"Papers are written by labs. Engineering is written on Discord. The textbook lives in the middle and tries not to fall."
Sage, Mid-Stack-Reader AI Agent
Like the text-LLM reading list in Section 14.5, the multimodal-stack reading list splits across foundational papers (the canonical citations every multimodal LLM engineer should know), live venues (CVPR, NeurIPS, ICCV, ICLR, ICML), and community channels (Discord, Twitter/X, Reddit) where the latest model releases get debugged in public.
Prerequisites
This is an end-of-part reading list and assumes familiarity with Part V (Chapters 19 through 25). No new technical prerequisites.
Like the text-LLM reading list in Section 14.5, the multimodal stack splits across foundational papers, applied tutorials, and online communities. The convention from that section ("skim arXiv for theory, read fast-moving blogs for practice, lurk Discord for releases") carries over with minor adjustments: r/MachineLearning and Hugging Face daily papers cover most arXiv signal; Black Forest Labs / Stability / DeepMind blogs cover applied releases; the LeRobot Discord is the central robotics-LM community as of 2026.
25.5.1 Foundational papers
- Ho et al., "Denoising Diffusion Probabilistic Models" (2020): DDPM.
- Rombach et al., "Latent Diffusion Models" (2021): Stable Diffusion's architecture.
- Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (2024): the SD3 paper and the strongest single argument for why DiT replaced UNet across the 2024-25 image / video frontier.
- Whisper paper, Radford et al. (2022): large-scale weakly-supervised ASR.
- MusicGen paper, Copet et al. (2023).
- Polyak et al., "Movie Gen: A Cast of Media Foundation Models" (Meta, 2024): the canonical Movie Gen reference.
- Reed et al., "Gato: A Generalist Agent" (DeepMind, 2022): the foundational citation for the Part VII thesis that one transformer with one loss learns text, image, and action.
- Meta, "Chameleon: Mixed-Modal Early-Fusion Foundation Models" (2024): the cleanest single argument for unified-vocabulary, all-modalities-as-tokens architectures.
- Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (2023): canonical "actions as tokens" paper.
- Poole et al., "DreamFusion: Text-to-3D using 2D Diffusion" (2022): canonical Score Distillation Sampling reference.
- OpenAI, "Sora: Video Generation Models as World Simulators" (2024): the video-as-a-sequence-problem framing.
- Ha and Schmidhuber, "World Models" (2018): the foundational world-model paper.
- Parker-Holder et al., "Genie 2: A Large-Scale Foundation World Model" (DeepMind, 2024) and Genie 3 (2025).
- Girdhar et al., "ImageBind: One Embedding Space to Bind Them All" (Meta, 2023): empirical proof that all modalities can align to a single embedding space.
25.5.2 Tutorials and blogs
- Black Forest Labs blog: the lab behind FLUX; primary source for image-gen architecture releases.
- The Annotated Diffusion Model: paper-style walkthrough.
- Hugging Face Diffusion Models Course.
- Yang Song, "Generative Modeling by Estimating Gradients" (2021): score-based perspective.
- The Decoder, Ainave, Stability blog: multimodal-focused outlets.
25.5.3 Communities
- r/StableDiffusion: still the most active open-image community.
- Stability AI Discord and Black Forest Labs Discord.
- Civitai (Western), Tensor.Art, and ShakkerAI (Asian image-model hubs): the broader 2024-25 model and LoRA hubs.
The RIAA lawsuits against Suno and Udio (filed 2024) and the parallel publisher litigation against generative image platforms shape platform availability and your downstream use rights. By 2026, several music platforms restrict commercial use, and image-platform terms now include explicit disallowed-use clauses. Read the terms before shipping; do not assume the situation in mid-2026 is the same as the one in mid-2024.
What's Next?
This chapter completes the current part. The next part, Part VI: Agentic AI, opens a new arc; see the part index for chapter ordering.