Tools of the Trade: Multimodal Stack

Consolidated reference: platforms, libraries, datasets, models, and external resources for this part.

Chapter opener illustration: Tools of the Trade: Multimodal Stack.

"Every modality is a tax on attention. The trick is making the user forget which one they are looking at."

PipPip, Multimodal-Outfitted AI Agent
Looking Back

Chapters 20 through 24 walked the modalities one by one. This chapter is the multimodal toolkit: which library handles audio well, which one handles documents, which one is opinionated about VLMs, and how to wire them together without a YAML graveyard.

Big Picture

Part VII covered four modalities: image (Imagen 4, FLUX.1, SD3, Midjourney, DALL-E), video (Veo 2, Sora 2, Runway Gen-4, Kling), speech (ElevenLabs, Whisper), and music (Suno v5, MusicGen). Each modality has its own platform shelf, library, dataset, and model. This chapter consolidates them.

Chapter Overview

Part V covered audio, video, document AI, VLMs, 3D, and VLA models. This chapter consolidates the multimodal toolchain: closed-API platforms (frontier) and open-weight platforms (running), Hugging Face diffusers as the multimodal analog of transformers, the canonical pretraining (LAION-5B, WebVid, AudioSet) and evaluation (COCO, VQA, MMMU) datasets, the model zoo organized by modality, and the external venues that keep the multimodal toolbox current.

Multimodal Tools of the Trade mirrors the text-LLM toolbox of Part II and Part III, but the model count is larger and the cost-quality envelope changes faster. Use this chapter as the bookmarkable index for every recipe in Part V.

Note: Learning Objectives
Library Shortcut

For all four modalities, one library wraps the open-source side:

pip install diffusers transformers

diffusers ships every major open diffusion architecture (SD, SD3, FLUX, AudioLDM, MusicGen) behind a uniform Pipeline API. For the closed APIs you also need the providers' SDKs.

Sections in This Chapter

Prerequisites

What Comes Next

Next: Chapter 26: AI Agent Foundations, opening Part VI. Parts I to V built models that perceive, reason, and produce, but each call is one-shot. Part VI is where the model gets agency: it plans, calls tools, remembers, and coordinates with other agents and humans. The shift is from "predict the next token" to "decide the next action".