Part V: Multimodal LLMs

Part opener illustration: Part V: Multimodal LLMs.

Part Overview

Vision-language and Omni models, image/video/audio generation, document understanding, 3D, embodied AI / VLA / robotics.

Big Picture

Vision-language and Omni models, image/video/audio generation, document understanding, 3D, embodied AI / VLA / robotics.

Chapters

What's Next?

This part begins with Chapter 20: Audio, Music, and Video Generation. Each chapter builds on the previous one, so we recommend reading Part V in order.