External Reading & Communities

Section 25.5

"Papers are written by labs. Engineering is written on Discord. The textbook lives in the middle and tries not to fall."

SageSage, Mid-Stack-Reader AI Agent
Big Picture

Like the text-LLM reading list in Section 14.5, the multimodal-stack reading list splits across foundational papers (the canonical citations every multimodal LLM engineer should know), live venues (CVPR, NeurIPS, ICCV, ICLR, ICML), and community channels (Discord, Twitter/X, Reddit) where the latest model releases get debugged in public.

Prerequisites

This is an end-of-part reading list and assumes familiarity with Part V (Chapters 19 through 25). No new technical prerequisites.

Key Insight
The reading-list strategy from Section 16.5 applies

Like the text-LLM reading list in Section 14.5, the multimodal stack splits across foundational papers, applied tutorials, and online communities. The convention from that section ("skim arXiv for theory, read fast-moving blogs for practice, lurk Discord for releases") carries over with minor adjustments: r/MachineLearning and Hugging Face daily papers cover most arXiv signal; Black Forest Labs / Stability / DeepMind blogs cover applied releases; the LeRobot Discord is the central robotics-LM community as of 2026.

25.5.1 Foundational papers

25.5.2 Tutorials and blogs

25.5.3 Communities

Warning: AI music regulation moved fast in 2024-25

The RIAA lawsuits against Suno and Udio (filed 2024) and the parallel publisher litigation against generative image platforms shape platform availability and your downstream use rights. By 2026, several music platforms restrict commercial use, and image-platform terms now include explicit disallowed-use clauses. Read the terms before shipping; do not assume the situation in mid-2026 is the same as the one in mid-2024.

What's Next?

This chapter completes the current part. The next part, Part VI: Agentic AI, opens a new arc; see the part index for chapter ordering.

Further Reading

Vision-Language Platforms and Foundation Models

Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML 2021. arXiv:2103.00020
Liu, H., et al. (2023). "Visual Instruction Tuning (LLaVA)." NeurIPS 2023. arXiv:2304.08485
Wang, P., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." Alibaba. arXiv:2409.12191
Beyer, L., et al. (2024). "PaliGemma: A versatile 3B VLM for transfer." Google DeepMind. arXiv:2407.07726

Dataset Hubs and Curation

Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS Datasets 2022. arXiv:2210.08402
Gadre, S. Y., et al. (2023). "DataComp: In search of the next generation of multimodal datasets." NeurIPS Datasets 2023. arXiv:2304.14108
Lhoest, Q., et al. (2021). "Datasets: A Community Library for Natural Language Processing." EMNLP 2021 Demonstrations. arXiv:2109.02846

Evaluation Suites

Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." CVPR 2024. arXiv:2311.16502
Liu, Y., et al. (2024). "MMBench: Is Your Multi-modal Model an All-around Player?" ECCV 2024. arXiv:2307.06281
Heusel, M., et al. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)." NeurIPS 2017. arXiv:1706.08500

Open-Source Generative Models

Esser, P., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3)." ICML 2024. arXiv:2403.03206
Labs, B. F. (2024). "FLUX.1: Open-source image generation flow models." Black Forest Labs. blackforestlabs.ai/announcing-black-forest-labs
Polyak, A., et al. (2024). "Movie Gen: A Cast of Media Foundation Models." Meta AI. arXiv:2410.13720