Chapter 25: Tools of the Trade: Multimodal Stack

Chapter opener illustration: Tools of the Trade: Multimodal Stack.

"Every modality is a tax on attention. The trick is making the user forget which one they are looking at."
Pip, Multimodal-Outfitted AI Agent

Looking Back

Chapters 20 through 24 walked the modalities one by one. This chapter is the multimodal toolkit: which library handles audio well, which one handles documents, which one is opinionated about VLMs, and how to wire them together without a YAML graveyard.

Big Picture

Part VII covered four modalities: image (Imagen 4, FLUX.1, SD3, Midjourney, DALL-E), video (Veo 2, Sora 2, Runway Gen-4, Kling), speech (ElevenLabs, Whisper), and music (Suno v5, MusicGen). Each modality has its own platform shelf, library, dataset, and model. This chapter consolidates them.

Chapter Overview

Part V covered audio, video, document AI, VLMs, 3D, and VLA models. This chapter consolidates the multimodal toolchain: closed-API platforms (frontier) and open-weight platforms (running), Hugging Face diffusers as the multimodal analog of transformers, the canonical pretraining (LAION-5B, WebVid, AudioSet) and evaluation (COCO, VQA, MMMU) datasets, the model zoo organized by modality, and the external venues that keep the multimodal toolbox current.

Multimodal Tools of the Trade mirrors the text-LLM toolbox of Part II and Part III, but the model count is larger and the cost-quality envelope changes faster. Use this chapter as the bookmarkable index for every recipe in Part V.

Note: Learning Objectives

Compare closed-API and open-weight multimodal platforms across modality coverage and cost.
Use Hugging Face diffusers and transformers together for a multimodal pipeline.
Pick the right teaching dataset (LAION-5B, WebVid, AudioSet, COCO, VQA, MMMU) for a given exercise.
Navigate the multimodal model zoo by modality, license, and inference footprint.
Identify the conferences, blogs, and communities that maintain the multimodal stack.

Library Shortcut

For all four modalities, one library wraps the open-source side:

pip install diffusers transformers

diffusers ships every major open diffusion architecture (SD, SD3, FLUX, AudioLDM, MusicGen) behind a uniform Pipeline API. For the closed APIs you also need the providers' SDKs.

Sections in This Chapter

Prerequisites

At least one of Chapter 20, 21, 22, 23, or 24
LLM API tooling from Chapter 14
Python and shell comfort

What Comes Next

Next: Chapter 26: AI Agent Foundations, opening Part VI. Parts I to V built models that perceive, reason, and produce, but each call is one-shot. Part VI is where the model gets agency: it plans, calls tools, remembers, and coordinates with other agents and humans. The shift is from "predict the next token" to "decide the next action".