Section 25.4: Models

"The model zoo grows three rows a quarter. The reading list grows one row a year. Plan your career accordingly."
Frontier, Model-Zoo-Wrangler AI Agent

Big Picture

The shape of this section mirrors the text-LLM model zoo in Module 14 and Section 12.1, but with image, audio, video, and VLA models added in. It catalogs the open-weights checkpoints (FLUX, SD3, MusicGen, Whisper, OpenVLA, Qwen2-VL, Llava-OneVision) and the closed-API equivalents that your agent or RAG pipeline calls in production.

Prerequisites

This section assumes the multimodal architectures across Chapters 19 through 24 and the open-versus-closed licensing landscape from Section 10.6.

Key Insight

The multimodal model zoo mirrors the text model zoo

The shape of the table on this page is the shape of the text model-zoo discussion in Module 12 and Section 7.1. Frontier closed APIs lead on quality, open weights lead on flexibility, and "speed-focused" or distilled variants exist for both. The forecasting heuristic from those sections applies: today's SOTA is next quarter's commodity, so build behind an interface that lets you swap providers.

Key Insight: Same loss, different artifacts

Imagen 4, Veo 3, Suno, Whisper, OpenVLA: every model in this section minimizes a likelihood under some combination of text / audio / pixel / action data. Different artifacts come out the other end, but the training paradigm (maximize log-likelihood under a learned model with cross-entropy or score-matching variants, scale data and parameters, fine-tune for alignment) is identical to the recipe in Section 6.2. The model zoo looks varied; the engineering pattern that produces it is monotonous.

The multimodal model zoo is wider than the text-only zoo. This section enumerates the specific checkpoints worth knowing per modality.

25.4.1 Image generation models

Imagen 4 Ultra and Imagen 4 Fast (Google DeepMind, 2024-25) are Google's latest closed image-generation models in two tiers (Ultra for quality, Fast for latency), currently the SOTA on photorealism and prompt fidelity. Pick Imagen 4 Ultra for photorealism inside GCP workflows; for stylized work, Midjourney v7 leads.
GPT-image-1 (OpenAI, 2025) and DALL-E 3 (2023): GPT-image-1 is the 2025 native OpenAI image model API; the older "GPT-Image" naming was imprecise. Their objective is image generation tightly coupled with conversational prompting and multi-turn editing.
NanoBanana (Google, 2025): an image-editing-specialized model in the Google family; the right pick for inpainting and contextual edits rather than full text-to-image.
Midjourney v7 (Midjourney, 2025) is the seventh-generation Midjourney model, retaining the community lead on aesthetic quality by user-preference votes. Its objective is to be the aesthetic-leader image model with a distinctive "Midjourney look", which matters when artistic quality beats prompt-fidelity. Pick Midjourney v7 for art-direction, illustration, and concept-design workflows.
FLUX.1 [dev / pro / schnell] (Black Forest Labs, 2024) is the open-weights image-generation family that took the open-SOTA crown from Stable Diffusion. Its objective is to provide truly competitive open weights at three speed/quality tiers (schnell = 4-step, dev = full quality, pro = API-only premium), which matters when you need to self-host LoRAs and custom fine-tunes. Pick FLUX as the open default in 2026; note dev's non-commercial license restricts some uses.
Stable Diffusion 3 / SD3.5 (Stability AI, 2024) is the latest Stable Diffusion family. Its objective is to be the open weights with the broadest ecosystem of LoRAs, ControlNets, and tutorials, which matters when ecosystem maturity matters more than top-line benchmark. Pick SD3.5 when you need an existing LoRA ecosystem; for raw quality, FLUX has overtaken it.
SDXL Turbo and FLUX.1 Schnell: the speed-focused distilled variants. Schnell is a 4-step distillation of FLUX.1 [dev]; SDXL Turbo is a 1-4 step distillation. The distillation lineage matters because output style inherits from the base.
HiDream-I1 (2025-Q2): open image model competitive with FLUX.
BAGEL, OmniGen, Janus-Pro: image-and-text generative unified models (emerging architectures, 2024-25) that produce both modalities from a single model. Worth knowing as a trend even if not production-ready in mid-2026.

25.4.2 Video generation models

Sora (OpenAI, Dec 2024 public release) is OpenAI's flagship video model. SOTA video quality, longer durations, plausible physics, strong prompt adherence. The core concept is large-scale spatiotemporal transformer training with strong safety filters.
Veo 2 (Google DeepMind, 2024) is Google's flagship video model competitive with Sora on quality. Its objective is high-resolution (4K-capable) cinematic video with extensive prompt-control vocabulary (camera moves, focus, lens specs), which matters for filmmaker workflows. Pick Veo for GCP-native or filmmaker-vocabulary workflows.
Runway Gen-4 (Runway, 2025) is Runway's fourth-generation video model. Its objective is creative-tool integration over raw quality (Director Mode, motion brush, multi-clip story editing), which matters when editing capability beats single-shot generation. Pick Runway Gen-4 for production workflows that need integrated editing.
Kling (Kuaishou, 2024) is the Chinese flagship video model briefly leading on physics before Sora 2 caught up. Its objective is competitive video generation at significantly lower cost per second; pick when cost-per-output matters and Chinese hosting is acceptable.
HunyuanVideo and HunyuanVideo 2 (Tencent, 2024-25): the open-weights video SOTA series; v2 (2025) is the current open competitor to Sora/Veo for self-hosted video generation. Expect to dedicate multiple high-end GPUs.
Wan 2.1 / Wan 2.5 (Alibaba, 2024-25): the Alibaba open video model series; Wan 2.5 (2025) is the current frontier, including a real-time tier.
ByteDance Seedance (2025): leaderboard-topping video model that emerged in 2025.
Lightricks LTX-Video (2024-12): open, real-time video generation; the right pick when sub-second generation is the binding constraint.

25.4.3 Audio and music models

ElevenLabs Multilingual v3 (ElevenLabs, 2024-2025) is the multilingual production TTS leader, covering 70+ languages with consistent voice identity across languages. Its objective is to be the high-quality multilingual TTS service for podcasts, audiobooks, and dubbing, which matters when you need one voice to speak many languages naturally. Pick ElevenLabs v3 for production TTS in 2026; for self-hosted, F5-TTS is the open alternative.
Whisper large-v3 (OpenAI, 2023) is the open SOTA ASR model, covering 100 languages and supporting translation to English. Its objective is to provide near-state-of-the-art transcription with open weights, which matters when you want self-hostable ASR. Pick Whisper large-v3 as the default ASR; for high-throughput, faster-whisper reimplements it at 4x speed.
Suno v4 / v4.5 (Suno, 2024-25) is the SOTA closed music-generation model producing full songs with vocals, instruments, and mix. Pick Suno v4.5 for full-song generation; for open alternatives, MusicGen and Stable Audio Open lag in quality. Stable Audio 2.5 (2025) is the latest Stability Audio release.
Native-audio LLMs: Gemini 2.5 Native Audio, GPT-4o native audio, and Sesame CSM represent the new multimodal audio-LLM tier where a single model handles text and audio without a separate TTS module. The right pick for real-time conversational audio applications.
MusicGen (Meta, 2023) is the open-weights music generation model from Meta's audiocraft library, generating instrumental tracks from text or melody prompts. Its objective is to provide open music-generation research weights, which matters for academic study; for production, Suno's quality is significantly higher. Pick MusicGen for research or for open instrumental-only workflows.
Stable Audio Open (Stability AI, 2024) is Stability AI's open audio generation model. Its objective is to provide open weights for short-form sound-effect and music generation, which matters for game-audio and sound-design pipelines. Pick Stable Audio Open for sound-effect generation; for full-song music, neither it nor MusicGen matches Suno.

25.4.4 Comparing the models

Table 25.4.1: 33.4.1 Multimodal models (mid-2026).

Modality	Closed leader	Open leader	Speed-focused
Image	Imagen 4 / Midjourney v7	FLUX.1-dev	FLUX.1-schnell, SDXL Turbo
Video	Sora / Veo 2	HunyuanVideo 2, Wan 2.5	LTX-Video (real-time)
TTS	ElevenLabs v3	F5-TTS	XTTS-v2
ASR	Whisper API	Whisper large-v3	faster-whisper
Music	Suno v4.5 / Udio	MusicGen	Stable Audio Open

Note: Multimodal model lifespans are short

SOTA changes every 3 to 6 months in image and video. Build your pipeline to swap models behind an interface; the model you pick this quarter is probably not the model you ship next year.

What's Next?

In the next section, Section 25.5: External Reading & Communities, we build on the material covered here.

Further Reading

Vision-Language Models

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485. Reference open vision-language model.

Beyer, L., Steiner, A., Pinto, A. S., et al. (2024). "PaliGemma: A versatile 3B VLM for transfer." arXiv:2407.07726. Reference 2024 open VLM.

Chen, Z., Wu, J., Wang, W., et al. (2024). "InternVL: Scaling up Vision Foundation Models." CVPR 2024. arXiv:2312.14238. Reference for large open vision-language model.