Platforms

Section 25.1

"Image API: dollars per call. Video API: dollars per second. Audio API: dollars per minute. Welcome to multimodal billing."

QuantQuant, Per-Token-Counter AI Agent
Big Picture

Multimodal platforms split the same way the text-LLM platforms in Section 14.1 do: closed APIs at the frontier, open weights behind them. This section maps the 2026 platform shelf (Midjourney, OpenAI Images, Imagen, FLUX, SD, Sora, Veo, Runway, Kling, ElevenLabs, Suno, Udio, Cartesia) and tells you which platform earns each call in an LLM-driven multimodal agent product.

Prerequisites

This section assumes the text-LLM API patterns from Section 11.1 and the closed-versus-open trade-off from Section 10.6. Familiarity with the multimodal architectures in Section 22.1 and Section 20.1 helps you read pricing.

Key Insight: Same split as text platforms, different latency

Multimodal platforms split the same way the text-LLM platforms in Section 11.1 do: closed APIs at the frontier, open weights behind them. The calling patterns from Section 11.1 (auth, retries, streaming) and the stack from Section 14.1 transfer essentially unchanged. The only platform-specific surprises are the latency profiles (image: seconds; video: minutes; not the sub-second range of text chat) and pricing (multimodal "tokens" are large; a video generation can cost more than 10000 chat completions).

Each modality has its own platform shelf. Image and video generation in particular split sharply between closed APIs (where the SOTA lives) and open-weights (where flexibility and self-hosting live).

Grid of multimodal platforms organized by modality (image, video, audio) and access pattern (closed API frontier vs open weights), with latency and billing notes
Figure 25.1.1: The 2026 multimodal platform grid. Closed APIs (left) hold the quality frontier; open weights (right) give you self-hosting and fine-tuning. Latency and billing scale up sharply from image (seconds, cents) to video (minutes, dollars).

25.1.1 Image platforms

Two image-editing capabilities became platform-level features in 2024-25 that the older "text-to-image" frame misses. FLUX.1 Fill and FLUX Kontext (Black Forest Labs, 2025) cover image inpainting and contextual editing as first-class operations. NanoBanana (Google, 2025) is Google's image-editing model. HiDream-I1 (2025-Q2) is an open image model competitive with FLUX. Image editing and inpainting are now their own platform row, not a sub-feature.

25.1.2 Video platforms

The 2024-25 video-platform shelf grew several new entrants worth knowing alongside Sora and Veo. ByteDance Seedream / Seedance (2025) is the top-of-leaderboard image and video pair that emerged in 2025. Lightricks LTX-Video (2024-12) is open, real-time video generation; pair with Wan 2.5 (Alibaba, 2025) for the "real-time / sub-second video" tier that most platform tables miss. Hailuo / MiniMax video (2024-25) covers the Chinese-hosted video platforms alongside Kling. The Sora launch (Dec 2024 public release) and the Veo 2 launch (Dec 2024) are the two adoption case studies that anchored the late-2024 platform race.

25.1.3 Audio and music platforms

Note: Aggregators are underrated

Replicate and fal.ai host most open-weights multimodal models behind a uniform API. They are the right answer when you want to A/B test image or video models without managing GPUs.

What's Next?

In the next section, Section 25.2: Libraries & Frameworks, we build on the material covered here.

Further Reading

Multimodal Platforms

OpenAI (2024). "GPT-4o System Card." openai.com/index/gpt-4o-system-card. Reference multimodal LLM platform.
Google DeepMind (2024). "Gemini." arXiv:2312.11805. Reference native-multimodal LLM platform.
Anthropic (2024). "Claude 3.5 Sonnet." anthropic.com/news/claude-3-5-sonnet. Reference vision-capable LLM platform.