Chapter 27: Multimodal Generation | Building Conversational AI with LLMs and Agents

"The world is not made of words alone. To truly understand it, a machine must learn to see, hear, and read the messy, beautiful reality that surrounds us."
Pixel, Synesthetic AI Agent

Multimodal Generation chapter illustration — **Figure 27.0.1**: Text, images, audio, and video converge in a single model that sees, reads, and creates across modalities. Welcome to the era of AI that perceives the full spectrum.

Chapter Overview

Large language models began as text processors, but the frontier of AI has moved decisively toward multimodal systems that generate and understand images, audio, video, and structured documents alongside text. This chapter covers the complete landscape of multimodal AI, from diffusion models that create photorealistic images to speech synthesis systems that clone voices from seconds of audio, video generators that produce cinematic content from text prompts, and document understanding pipelines that extract structured data from scanned pages.

The chapter begins with image generation and vision-language models, exploring how systems like Stable Diffusion, DALL-E, and Midjourney work at an architectural level, along with the vision encoders and multimodal LLMs (GPT-4V, LLaVA, Gemini) that let models see and reason about images. It then covers audio and video generation, including text-to-speech, music synthesis, and the emerging world of text-to-video models like Sora. Finally, it addresses document AI, where OCR, layout analysis, and language models combine to extract information from real-world documents, a capability that feeds directly into LLM-powered applications.

By the end of this chapter, you will understand how modern generative models work across modalities, be able to build pipelines that combine text with images, audio, and video, and know how to choose the right approach for document understanding tasks. The chapter also explores unified omni-model architectures, embodied AI with Vision-Language-Action models, LLM-powered robotics, and 3D neural scene representation. These skills complement the embedding and vector search techniques covered earlier, enabling you to retrieve and generate content across multiple modalities.

Big Picture

Modern LLMs increasingly process images, audio, and video alongside text. This chapter covers vision-language models, document AI, and cross-modal architectures, extending your understanding of Transformers (Chapter 4) into the multimodal domain. These capabilities unlock the application patterns surveyed in Chapter 28.

Learning Objectives

Explain the mechanics of diffusion models and flow matching for image generation, building on the transformer architecture foundations
Use Stable Diffusion, DALL-E, and Midjourney APIs for controlled image generation and editing
Understand vision encoders (ViT, CLIP, SigLIP) and how they connect to language models via learned embeddings
Build applications with vision-language models including GPT-4V, LLaVA, and Gemini
Implement text-to-speech pipelines using modern TTS models and voice cloning
Understand architectures behind music generation and text-to-video models
Design document understanding pipelines using OCR and layout-aware models
Compare multimodal approaches and select the right tools for specific LLM applications
Understand unified multimodal (omni) architectures and how they differ from pipeline approaches
Describe Vision-Language-Action (VLA) models for embodied agents, including RT-2, OpenVLA, and sim-to-real transfer
Explain how LLMs serve as planners and coordinators in robotic navigation and multi-robot systems
Understand 3D Gaussian Splatting for neural scene representation and its integration with LLM-guided editing

Prerequisites

Chapter 06: Inside LLMs (transformer architecture, attention mechanisms)
Chapter 07: Training LLMs (pre-training, fine-tuning concepts)
Chapter 10: LLM APIs (API usage patterns, streaming, structured outputs)
Chapter 11: Prompt Engineering (effective prompting strategies)
Basic understanding of neural network architectures (CNNs, encoders/decoders)
Familiarity with Python and pip/conda for installing ML libraries

Sections

What's Next?

In the next chapter, Chapter 28: LLM Applications, we survey major LLM application domains, from vibe-coding and AI-assisted writing to search and creative tools.