Section 25.1: Platforms

"Image API: dollars per call. Video API: dollars per second. Audio API: dollars per minute. Welcome to multimodal billing."
Quant, Per-Token-Counter AI Agent

Big Picture

Multimodal platforms split the same way the text-LLM platforms in Section 14.1 do: closed APIs at the frontier, open weights behind them. This section maps the 2026 platform shelf (Midjourney, OpenAI Images, Imagen, FLUX, SD, Sora, Veo, Runway, Kling, ElevenLabs, Suno, Udio, Cartesia) and tells you which platform earns each call in an LLM-driven multimodal agent product.

Prerequisites

This section assumes the text-LLM API patterns from Section 11.1 and the closed-versus-open trade-off from Section 10.6. Familiarity with the multimodal architectures in Section 22.1 and Section 20.1 helps you read pricing.

Key Insight: Same split as text platforms, different latency

Multimodal platforms split the same way the text-LLM platforms in Section 11.1 do: closed APIs at the frontier, open weights behind them. The calling patterns from Section 11.1 (auth, retries, streaming) and the stack from Section 14.1 transfer essentially unchanged. The only platform-specific surprises are the latency profiles (image: seconds; video: minutes; not the sub-second range of text chat) and pricing (multimodal "tokens" are large; a video generation can cost more than 10000 chat completions).

Each modality has its own platform shelf. Image and video generation in particular split sharply between closed APIs (where the SOTA lives) and open-weights (where flexibility and self-hosting live).

Grid of multimodal platforms organized by modality (image, video, audio) and access pattern (closed API frontier vs open weights), with latency and billing notes — **Figure 25.1.1:** The 2026 multimodal platform grid. Closed APIs (left) hold the quality frontier; open weights (right) give you self-hosting and fine-tuning. Latency and billing scale up sharply from image (seconds, cents) to video (minutes, dollars).

25.1.1 Image platforms

Two image-editing capabilities became platform-level features in 2024-25 that the older "text-to-image" frame misses. FLUX.1 Fill and FLUX Kontext (Black Forest Labs, 2025) cover image inpainting and contextual editing as first-class operations. NanoBanana (Google, 2025) is Google's image-editing model. HiDream-I1 (2025-Q2) is an open image model competitive with FLUX. Image editing and inpainting are now their own platform row, not a sub-feature.

Midjourney (Midjourney Inc., 2022; v7 in 2025) is the Discord-first image-generation service that defined the contemporary "AI art aesthetic" and remains the SOTA for stylized, painterly, and cinematic imagery. Its objective is to be the highest-quality artistic-image generator at the cost of API friction (the official API is gated and limited), which matters when output quality outweighs integration convenience. The core concept is heavily-curated training data plus their proprietary aesthetic-tuned models; prompts are short and stylistic. Pick Midjourney when artistic quality is the binding constraint; avoid when you need fine programmatic control or licensing for commercial use without their commercial tier.
OpenAI Images (DALL-E 3 / GPT-Image) (OpenAI, 2023+) is OpenAI's image generation API, distinguished by tight integration with GPT-4/5 for prompt interpretation and rewriting. Its objective is to handle complex multi-element prompts and text-in-image rendering better than competitors, which matters when your prompts are long and specific or when you need legible text in the image. The core concept is "GPT-as-image-prompter" where the LLM expands and refines your prompt before the diffusion step. Pick GPT-Image when you have long prompts or need text rendering; for pure aesthetic quality, Midjourney typically wins.
Google Imagen 4 (Google DeepMind, 2025) is Google's flagship text-to-image model, served via AI Studio and Vertex AI. Its objective is photorealistic rendering with strong prompt following, which matters when realism beats stylization. The core concept is large-scale latent-diffusion with Gemini-quality prompt encoding. Pick Imagen 4 for photorealistic imagery or for GCP-native workflows; outside GCP, Midjourney and FLUX have more open ecosystem support.
Black Forest Labs (FLUX.1) (Black Forest Labs, 2024) is the open-weights image-generation family from the team that previously built Stable Diffusion, and currently the strongest open competitor to Midjourney on quality. Its objective is to ship state-of-the-art open weights so the community can fine-tune and self-host without being locked into closed APIs, which matters when you need LoRAs for specific characters or styles. The core concept is a 12B-parameter rectified-flow transformer trained on a curated dataset; FLUX.1 [schnell] is Apache 2.0, [dev] is non-commercial. Pick FLUX as the open SOTA in 2026; check the license carefully because [dev] is research-only.
Stability AI (SD3 / SD3.5) (Stability AI, 2024) is the open Stable Diffusion family, the original open-weights image-generation ecosystem. Its objective is to provide open weights with a permissive community license at multiple sizes (Medium, Large), which matters as a self-hostable baseline for fine-tuning. The core concept is latent-diffusion at progressively larger scales; SD3.5 was the late-2024 update. Pick SD3 when you need lightweight or older-hardware-friendly options; for top quality, FLUX has overtaken it.
Replicate and fal.ai (2019 / 2023) are model-agnostic hosting platforms that resell most open-weights image and video models behind unified APIs. Their objective is to remove the GPU-management burden of hosting open models yourself, which matters when you want to A/B test FLUX vs SD3 vs Recraft vs Ideogram without provisioning a single GPU. The core concept is per-call billing on warm pools of hosted models. Pick Replicate or fal when you want multi-model agility; for one fixed model at high volume, self-hosting is cheaper.

25.1.2 Video platforms

The 2024-25 video-platform shelf grew several new entrants worth knowing alongside Sora and Veo. ByteDance Seedream / Seedance (2025) is the top-of-leaderboard image and video pair that emerged in 2025. Lightricks LTX-Video (2024-12) is open, real-time video generation; pair with Wan 2.5 (Alibaba, 2025) for the "real-time / sub-second video" tier that most platform tables miss. Hailuo / MiniMax video (2024-25) covers the Chinese-hosted video platforms alongside Kling. The Sora launch (Dec 2024 public release) and the Veo 2 launch (Dec 2024) are the two adoption case studies that anchored the late-2024 platform race.

OpenAI Sora (OpenAI, public release Dec 2024) is OpenAI's flagship video-generation model accessible via API and the ChatGPT product. Its objective is to be the SOTA in coherent, physically-consistent video (good motion, stable objects, plausible physics) up to one minute, which matters because most rival models still produce noticeable temporal artifacts after a few seconds. The core concept is a large transformer trained on video tokens with spatiotemporal attention. Pick Sora when quality and length are essential; for shorter cinematic shots, Runway and Veo are credible alternatives.
Google Veo 2 / 3 (Google DeepMind, 2024-2025) is Google's flagship video model, served via AI Studio and Vertex. Its objective is to compete with Sora on cinematic quality with first-class integration into Google's image-and-video creative tools, which matters in GCP-hosted creative workflows. The core concept is large-scale video diffusion plus a Gemini-class prompt encoder. Pick Veo for GCP-native pipelines or when you want a Sora alternative.
Runway Gen-4 (Runway, 2025) is the product-first video-generation platform, distinguished by deep integration with creative tools (Director Mode, multi-clip storyboards, motion brushes). Its objective is to be a usable post-production tool, not just a generator, which matters for filmmakers and ad agencies. Pick Runway when you need creative-tool integration beyond bare generation; for pure API access, Sora and Veo are simpler.
Kling (Kuaishou, 2024) is the Chinese video-generation product that briefly led the West on physical plausibility before Sora 2 caught up. Its objective is fluid, physically-coherent motion at competitive cost, which matters because Kling is significantly cheaper than Sora and Veo per second. Pick Kling for cost-sensitive workflows; check data-residency carefully as Kling is hosted in mainland China.
Luma Dream Machine (Luma AI, 2024) is Luma's video-generation product, distinguished by image-to-video animation strength (give it a still image, get a coherent motion clip). Its objective is to be the best-of-class for image-to-video, which matters for animating product photography or character illustrations.
Pika (Pika Labs, 2023) is the social-first video-generation product, optimized for short, viral, effects-heavy clips. Its objective is creative-effects breadth (lip-sync, in-paint, scene-extend) rather than raw quality, which matters for social-media content workflows.

25.1.3 Audio and music platforms

ElevenLabs (ElevenLabs, 2022) is the dominant text-to-speech and voice-cloning API, used by most podcast-generation, audiobook, and dubbing products. Its objective is to produce TTS indistinguishable from human speech across hundreds of voices and 30+ languages, which matters when synthetic-sounding audio breaks immersion. The core concept is large-scale TTS models trained on diverse speech plus a few-second voice-cloning capability. Pick ElevenLabs as the production-default TTS in 2026; for cost-sensitive bulk TTS, open models like F5-TTS are catching up. Modern competitors worth A/B testing: Cartesia Sonic (2025) for low-latency real-time speech, Sesame Maya (2025) for natural conversational speech, PlayHT, Hume, and OpenAI TTS round out the modern TTS bench.
Suno v4 / v4.5 (Suno, 2024-25) is the dominant text-to-music generation product, producing full songs (vocals + instruments + mix) from a single prompt. Its objective is to make music generation as easy as text generation, which matters for content-creator workflows where licensed music is expensive. The core concept is large multimodal audio transformers with separate vocal and instrumental conditioning. Pick Suno for full-song generation; check copyright carefully because of the ongoing RIAA lawsuits (filed 2024) against Suno and Udio about training-data provenance, which affect both platform availability and your downstream use rights.
Udio (Udio, 2024) is Suno's main competitor with a tighter focus on remix-style editing and stem export. Its objective is to be the music-production tool, not just a generator, which matters for musicians who want editable outputs rather than finished tracks. Pick Udio for stem-level control; for one-shot full-song generation, Suno is simpler.
Google Lyria (Google DeepMind, 2023; Lyria 2 in 2024) is Google's music-generation API, served through YouTube Music's "Dream Track" and Vertex AI. Its objective is to be the music-generation option for Google Cloud workflows with explicit attention to safety filters and content provenance via SynthID watermarking. Pick Lyria for GCP-native workflows; for general-purpose music, Suno or Udio have more momentum.
OpenAI Whisper API (OpenAI, 2022) is the automatic-speech-recognition API based on the open-weights Whisper model family. Its objective is to provide near-state-of-the-art multilingual ASR (transcription) via API or self-hosting, which matters as the inverse direction of TTS for transcription pipelines. The core concept is encoder-decoder transformer trained on 680K hours of weakly-supervised multilingual audio. Pick Whisper as the default ASR; for self-hosting, the open Whisper Large v3 weights are state-of-the-art.

Note: Aggregators are underrated

Replicate and fal.ai host most open-weights multimodal models behind a uniform API. They are the right answer when you want to A/B test image or video models without managing GPUs.

What's Next?

In the next section, Section 25.2: Libraries & Frameworks, we build on the material covered here.

Further Reading

Multimodal Platforms

OpenAI (2024). "GPT-4o System Card." openai.com/index/gpt-4o-system-card. Reference multimodal LLM platform.

Google DeepMind (2024). "Gemini." arXiv:2312.11805. Reference native-multimodal LLM platform.

Anthropic (2024). "Claude 3.5 Sonnet." anthropic.com/news/claude-3-5-sonnet. Reference vision-capable LLM platform.