Pathway 20: "I Want to Build Multimodal AI Applications" (Multimodal AI Developer)
Target audience: Developers and engineers building applications that process images, audio, video, and documents alongside text using multimodal LLMs
Goal: Understand how multimodal models work (vision encoders, cross-attention, contrastive learning), when to use native multimodal models vs. pipeline approaches, and how to build production applications for document AI, image understanding, and audio processing.
Chapter Guide
- Skim Ch 04: The Transformer Architecture (cross-attention, encoder-decoder structure) cross-attention and encoder-decoder structure
- Skim Ch 07: The Modern LLM Landscape (multimodal model families) survey of multimodal model families
- Focus Ch 10: Working with LLM APIs (multimodal API patterns, image/audio inputs) multimodal API patterns for image and audio
- Focus Ch 11: Prompt Engineering (multimodal prompting techniques) prompting techniques for vision and audio inputs
- Skim Ch 14: Fine-Tuning Fundamentals (if fine-tuning vision-language models) fine-tune vision-language models if needed
- Focus Ch 19: Embeddings and Vector Databases (multimodal embeddings, CLIP) multimodal embeddings and CLIP-based search
- Skim Ch 20: RAG (multimodal RAG with images and documents) multimodal RAG with images and documents
- Focus Ch 27: Multimodal Models (your core chapter: vision-language, audio, diffusion) your core chapter: vision, audio, and diffusion
- Focus Ch 28: LLM Applications (document AI, code generation, search) document AI, code generation, and applied patterns
- Focus Ch 29: Evaluation (evaluating multimodal outputs) evaluate multimodal outputs systematically
- Skim Ch 31: Production Engineering (deploying multimodal pipelines) deploy multimodal pipelines at scale
- Skim Ch 34: Emerging Architectures (next-generation multimodal models) next-generation multimodal architectures
- Skim Ch 35: AI and Society societal implications of multimodal AI
Recommended Appendices
- Appendix K: HuggingFace: Transformers, Datasets, and Hub – access multimodal models and datasets on HuggingFace
- Appendix V: Tooling Ecosystem – survey the broader tooling ecosystem for multimodal apps
What Comes Next
Return to the Reading Pathways overview to explore other pathways, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types, then start reading.