Section 25.5: External Reading & Communities

"Papers are written by labs. Engineering is written on Discord. The textbook lives in the middle and tries not to fall."
Sage, Mid-Stack-Reader AI Agent

Big Picture

Like the text-LLM reading list in Section 14.5, the multimodal-stack reading list splits across foundational papers (the canonical citations every multimodal LLM engineer should know), live venues (CVPR, NeurIPS, ICCV, ICLR, ICML), and community channels (Discord, Twitter/X, Reddit) where the latest model releases get debugged in public.

Prerequisites

This is an end-of-part reading list and assumes familiarity with Part V (Chapters 19 through 25). No new technical prerequisites.

Key Insight

The reading-list strategy from Section 16.5 applies

Like the text-LLM reading list in Section 14.5, the multimodal stack splits across foundational papers, applied tutorials, and online communities. The convention from that section ("skim arXiv for theory, read fast-moving blogs for practice, lurk Discord for releases") carries over with minor adjustments: r/MachineLearning and Hugging Face daily papers cover most arXiv signal; Black Forest Labs / Stability / DeepMind blogs cover applied releases; the LeRobot Discord is the central robotics-LM community as of 2026.

25.5.1 Foundational papers

Ho et al., "Denoising Diffusion Probabilistic Models" (2020): DDPM.
Rombach et al., "Latent Diffusion Models" (2021): Stable Diffusion's architecture.
Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (2024): the SD3 paper and the strongest single argument for why DiT replaced UNet across the 2024-25 image / video frontier.
Whisper paper, Radford et al. (2022): large-scale weakly-supervised ASR.
MusicGen paper, Copet et al. (2023).
Polyak et al., "Movie Gen: A Cast of Media Foundation Models" (Meta, 2024): the canonical Movie Gen reference.
Reed et al., "Gato: A Generalist Agent" (DeepMind, 2022): the foundational citation for the Part VII thesis that one transformer with one loss learns text, image, and action.
Meta, "Chameleon: Mixed-Modal Early-Fusion Foundation Models" (2024): the cleanest single argument for unified-vocabulary, all-modalities-as-tokens architectures.
Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (2023): canonical "actions as tokens" paper.
Poole et al., "DreamFusion: Text-to-3D using 2D Diffusion" (2022): canonical Score Distillation Sampling reference.
OpenAI, "Sora: Video Generation Models as World Simulators" (2024): the video-as-a-sequence-problem framing.
Ha and Schmidhuber, "World Models" (2018): the foundational world-model paper.
Parker-Holder et al., "Genie 2: A Large-Scale Foundation World Model" (DeepMind, 2024) and Genie 3 (2025).
Girdhar et al., "ImageBind: One Embedding Space to Bind Them All" (Meta, 2023): empirical proof that all modalities can align to a single embedding space.

25.5.2 Tutorials and blogs

Black Forest Labs blog: the lab behind FLUX; primary source for image-gen architecture releases.
The Annotated Diffusion Model: paper-style walkthrough.
Hugging Face Diffusion Models Course.
Yang Song, "Generative Modeling by Estimating Gradients" (2021): score-based perspective.
The Decoder, Ainave, Stability blog: multimodal-focused outlets.

25.5.3 Communities

r/StableDiffusion: still the most active open-image community.
Stability AI Discord and Black Forest Labs Discord.
Civitai (Western), Tensor.Art, and ShakkerAI (Asian image-model hubs): the broader 2024-25 model and LoRA hubs.

Warning: AI music regulation moved fast in 2024-25

The RIAA lawsuits against Suno and Udio (filed 2024) and the parallel publisher litigation against generative image platforms shape platform availability and your downstream use rights. By 2026, several music platforms restrict commercial use, and image-platform terms now include explicit disallowed-use clauses. Read the terms before shipping; do not assume the situation in mid-2026 is the same as the one in mid-2024.

What's Next?

This chapter completes the current part. The next part, Part VI: Agentic AI, opens a new arc; see the part index for chapter ordering.

Further Reading

Vision-Language Platforms and Foundation Models

Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML 2021. arXiv:2103.00020

Liu, H., et al. (2023). "Visual Instruction Tuning (LLaVA)." NeurIPS 2023. arXiv:2304.08485

Wang, P., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." Alibaba. arXiv:2409.12191

Beyer, L., et al. (2024). "PaliGemma: A versatile 3B VLM for transfer." Google DeepMind. arXiv:2407.07726

Dataset Hubs and Curation

Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS Datasets 2022. arXiv:2210.08402

Gadre, S. Y., et al. (2023). "DataComp: In search of the next generation of multimodal datasets." NeurIPS Datasets 2023. arXiv:2304.14108

Lhoest, Q., et al. (2021). "Datasets: A Community Library for Natural Language Processing." EMNLP 2021 Demonstrations. arXiv:2109.02846

Evaluation Suites

Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." CVPR 2024. arXiv:2311.16502

Liu, Y., et al. (2024). "MMBench: Is Your Multi-modal Model an All-around Player?" ECCV 2024. arXiv:2307.06281

Heusel, M., et al. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)." NeurIPS 2017. arXiv:1706.08500

Open-Source Generative Models

Esser, P., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3)." ICML 2024. arXiv:2403.03206

Labs, B. F. (2024). "FLUX.1: Open-source image generation flow models." Black Forest Labs. blackforestlabs.ai/announcing-black-forest-labs

Polyak, A., et al. (2024). "Movie Gen: A Cast of Media Foundation Models." Meta AI. arXiv:2410.13720