Datasets & Benchmarks

Section 25.3

"LAION came first. Then everyone learned to filter LAION. Now we filter the filter, and the cycle continues."

CensusCensus, Filter-Genealogist AI Agent
Big Picture

The multimodal-dataset shelf mirrors the text-pretraining-dataset shape from Section 7.4: a noisy web-scale corpus for pretraining (LAION, COYO, WebVid, CC-12M) and a curated instruction or eval set for downstream work (MMVet, MMBench, VBench, MS-COCO). This section maps both halves and tells you which dataset is safe to mention out loud in 2026.

Prerequisites

This section assumes the text-pretraining corpus discussion from Section 10.9 and the multimodal-architecture vocabulary from Section 22.1. Data-licensing and watermarking background is covered in detail later in the book.

Key Insight
Multimodal datasets are text datasets with a different alphabet

The structural shape mirrors text pretraining datasets from Section 6.4: a noisy web-scale corpus for pretraining (LAION-5B, WebVid, AudioSet) and curated evaluation sets for benchmarking (COCO, VBench, MMMU, LibriSpeech). The quality / scale / contamination trade-offs from Section 7.4 carry over without modification: deduplicate aggressively, filter for license cleanliness, hold out evaluation sets that the model has not seen during pretraining. The same observability and benchmark-saturation patterns from Section 42.2 apply to multimodal benchmarks as well; only the modality changes.

Multimodal datasets come in two flavors: the training-scale corpora behind the open models, and the benchmark datasets used for evaluation.

25.3.1 Image training and eval

25.3.2 Video training and eval

25.3.3 Audio and music training and eval

25.3.4 Comparing the datasets

Table 25.3.1: 33.3.1 Multimodal datasets (2026).
Dataset Modality Scale Use
LAION-5B Image-text 5.85B pairs Pretraining
MS COCO Image-caption ~330K images Captioning eval
WebVid Video-text 10M clips Video pretraining
AudioSet Audio 2M+ clips Audio classification
LibriSpeech Speech 1000h ASR eval
Warning: Licensing and consent

LAION-5B has faced criticism for inclusion of copyrighted and non-consensually-collected images. Use it for research; check enterprise licenses carefully for commercial training. Synthetic-data alternatives (generated by paid models) sidestep the issue but raise their own license-propagation questions.

What's Next?

In the next section, Section 25.4: Models, we build on the material covered here.

Further Reading

Multimodal Datasets

Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS 2022. arXiv:2210.08402. The reference large-scale image-text dataset.
Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312. The reference vision-grounding dataset.

Multimodal Benchmarks

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485. Reference vision-instruction model and benchmark.