Front Matter
FM.2: Reading Pathways

Pathway 20: "I Want to Build Multimodal AI Applications" (Multimodal AI Developer)

Pathway 20: "I Want to Build Multimodal AI Applications" (Multimodal AI Developer)
Time estimate: 4 to 5 weeks Difficulty: Intermediate

Target audience: Developers and engineers building applications that process images, audio, video, and documents alongside text using multimodal LLMs

Goal: Understand how multimodal models work (vision encoders, cross-attention, contrastive learning), when to use native multimodal models vs. pipeline approaches, and how to build production applications for document AI, image understanding, and audio processing.

Chapter Guide

Recommended Appendices

What Comes Next

Return to the Reading Pathways overview to explore other pathways, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types, then start reading.