Models

Section 25.4

"The model zoo grows three rows a quarter. The reading list grows one row a year. Plan your career accordingly."

FrontierFrontier, Model-Zoo-Wrangler AI Agent
Big Picture

The shape of this section mirrors the text-LLM model zoo in Module 14 and Section 12.1, but with image, audio, video, and VLA models added in. It catalogs the open-weights checkpoints (FLUX, SD3, MusicGen, Whisper, OpenVLA, Qwen2-VL, Llava-OneVision) and the closed-API equivalents that your agent or RAG pipeline calls in production.

Prerequisites

This section assumes the multimodal architectures across Chapters 19 through 24 and the open-versus-closed licensing landscape from Section 10.6.

Key Insight
The multimodal model zoo mirrors the text model zoo

The shape of the table on this page is the shape of the text model-zoo discussion in Module 12 and Section 7.1. Frontier closed APIs lead on quality, open weights lead on flexibility, and "speed-focused" or distilled variants exist for both. The forecasting heuristic from those sections applies: today's SOTA is next quarter's commodity, so build behind an interface that lets you swap providers.

Key Insight: Same loss, different artifacts

Imagen 4, Veo 3, Suno, Whisper, OpenVLA: every model in this section minimizes a likelihood under some combination of text / audio / pixel / action data. Different artifacts come out the other end, but the training paradigm (maximize log-likelihood under a learned model with cross-entropy or score-matching variants, scale data and parameters, fine-tune for alignment) is identical to the recipe in Section 6.2. The model zoo looks varied; the engineering pattern that produces it is monotonous.

The multimodal model zoo is wider than the text-only zoo. This section enumerates the specific checkpoints worth knowing per modality.

25.4.1 Image generation models

25.4.2 Video generation models

25.4.3 Audio and music models

25.4.4 Comparing the models

Table 25.4.1: 33.4.1 Multimodal models (mid-2026).
Modality Closed leader Open leader Speed-focused
Image Imagen 4 / Midjourney v7 FLUX.1-dev FLUX.1-schnell, SDXL Turbo
Video Sora / Veo 2 HunyuanVideo 2, Wan 2.5 LTX-Video (real-time)
TTS ElevenLabs v3 F5-TTS XTTS-v2
ASR Whisper API Whisper large-v3 faster-whisper
Music Suno v4.5 / Udio MusicGen Stable Audio Open
Note: Multimodal model lifespans are short

SOTA changes every 3 to 6 months in image and video. Build your pipeline to swap models behind an interface; the model you pick this quarter is probably not the model you ship next year.

What's Next?

In the next section, Section 25.5: External Reading & Communities, we build on the material covered here.

Further Reading

Vision-Language Models

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485. Reference open vision-language model.
Beyer, L., Steiner, A., Pinto, A. S., et al. (2024). "PaliGemma: A versatile 3B VLM for transfer." arXiv:2407.07726. Reference 2024 open VLM.
Chen, Z., Wu, J., Wang, W., et al. (2024). "InternVL: Scaling up Vision Foundation Models." CVPR 2024. arXiv:2312.14238. Reference for large open vision-language model.