Section 25.3: Datasets & Benchmarks

"LAION came first. Then everyone learned to filter LAION. Now we filter the filter, and the cycle continues."
Census, Filter-Genealogist AI Agent

Big Picture

The multimodal-dataset shelf mirrors the text-pretraining-dataset shape from Section 7.4: a noisy web-scale corpus for pretraining (LAION, COYO, WebVid, CC-12M) and a curated instruction or eval set for downstream work (MMVet, MMBench, VBench, MS-COCO). This section maps both halves and tells you which dataset is safe to mention out loud in 2026.

Prerequisites

This section assumes the text-pretraining corpus discussion from Section 10.9 and the multimodal-architecture vocabulary from Section 22.1. Data-licensing and watermarking background is covered in detail later in the book.

Key Insight

Multimodal datasets are text datasets with a different alphabet

The structural shape mirrors text pretraining datasets from Section 6.4: a noisy web-scale corpus for pretraining (LAION-5B, WebVid, AudioSet) and curated evaluation sets for benchmarking (COCO, VBench, MMMU, LibriSpeech). The quality / scale / contamination trade-offs from Section 7.4 carry over without modification: deduplicate aggressively, filter for license cleanliness, hold out evaluation sets that the model has not seen during pretraining. The same observability and benchmark-saturation patterns from Section 42.2 apply to multimodal benchmarks as well; only the modality changes.

Multimodal datasets come in two flavors: the training-scale corpora behind the open models, and the benchmark datasets used for evaluation.

25.3.1 Image training and eval

LAION-5B (LAION, 2022) is the 5.85-billion image-text pair dataset that was the open-source training set for Stable Diffusion, OpenCLIP, and most other early open image models. Its objective is to provide an open replacement for the closed datasets behind DALL-E and Imagen, which mattered for the entire open-source diffusion ecosystem. The core concept is web-crawled (image, alt-text) pairs filtered with CLIP similarity to ensure rough caption-image alignment. LAION-Aesthetics is the filtered "pretty pictures" subset. Pick LAION for academic research; for commercial training, the dataset's CSAM and non-consensual-image controversies require careful filtering or alternative sourcing.
DataComp (LAION + Apple + UW, 2023) is a controlled experimental framework where you fix the training pipeline and compete on data filtering, with the resulting filtered subsets becoming public datasets. Its objective is to make data quality (not model architecture) the variable being optimized, which matters because data filtering is where most modern image-model improvements come from. The core concept is fixed compute + fixed model + variable data, with leaderboard tracking. Pick DataComp's pre-filtered subsets as your default image-text training corpus in 2026.
MS COCO (Microsoft, 2014) is the classic 330K-image dataset with object segmentation, captions, and keypoints, foundational for image captioning and detection. Its objective is to be a richly-annotated reference dataset that supports multiple tasks, which matters because it became the de-facto image-captioning benchmark cited in every paper. Pick COCO for captioning evaluation; it is too small for training modern models and too well-known to test generalization.
Conceptual Captions (Google, 2018; CC3M, CC12M) is a Google-curated image-caption corpus where captions are cleaned alt-text. Its objective is a moderate-scale clean image-caption set for VLM and captioning training, which mattered as an alternative to the noisier LAION pipeline. Pick CC3M / CC12M for cleaner training subsets when LAION's noise is a problem.
DrawBench (Google, 2022) and ImageReward (THU, 2023) are human-preference benchmarks for image generation, the first for evaluation, the second a learned reward model. Their objective is to evaluate image-generation quality the way humans do, not via FID or CLIP score, which matters because automated metrics correlate weakly with human preferences. Pick DrawBench prompts for systematic eval; pick ImageReward as a learned auto-evaluator that approximates human preference. Pick-a-Pic v2 (2023-24) is the larger human-preference dataset that has overtaken ImageReward for reward-model training.
GenEval (Ghosh et al., 2023, arXiv:2310.11513) and HEIM (Lee et al., 2023, arXiv:2311.04287): image-gen benchmarks more rigorous than DrawBench, with structured object-counting and attribute-binding evaluation.
CommonCanvas (Mozilla AI / Common Voice, 2024): CC-licensed image-text dataset; the right pick when license hygiene matters more than scale.
PD12M (Source.Plus, 2024): public-domain 12M image dataset.
HQ-Edit and InstructPix2Pix datasets: image-editing-specific training data; relevant when fine-tuning models for inpainting / contextual edits.
T2V-CompBench (text-to-video composition benchmark, 2024) and Seed-TTS-Eval (ByteDance, 2024): newer modality-specific benchmarks.
GenAI-Bench (Li et al., 2024): compositional text-to-visual evaluation.
MMMU (Yue et al., 2023, arXiv:2311.16502) is the Massive Multi-discipline Multimodal Understanding benchmark, evaluating college-level subject reasoning over image-and-text questions across art, business, science, health, humanities, and engineering. Its objective is to be the MMLU of multimodal: a hard, broad, multi-domain eval that resists saturation. MMMU-Pro (2024) tightens the format from multiple choice to free-form reasoning. Pick MMMU as the default VLM benchmark in 2026 reports; treat MMMU-Pro as the contamination-resistant successor.

25.3.2 Video training and eval

WebVid (Oxford, 2021) is a 10M-clip video-text dataset scraped from stock-video sites, foundational for most early open video models. Its objective was to provide an open video-language corpus when no scaled alternative existed, which mattered for the first generation of video diffusion. The core concept is stock-video metadata as captions (which produces watermarks and biased aesthetics). Pick WebVid for research baselines; modern proprietary video models train on much larger and cleaner curated corpora.
HD-VILA-100M (Microsoft, 2022) is a 100M-clip high-definition video-language corpus scraped from YouTube, the larger and HD-class successor to WebVid. Its objective is to provide HD-quality training data at scale, which matters because low-res training propagates artifacts to generation. Pick HD-VILA-100M for HD video model pretraining; like WebVid, expect license caveats around YouTube terms of service.
VBench (Shanghai AI Lab, 2023) is a comprehensive video generation benchmark scoring outputs along 16 dimensions (subject consistency, motion smoothness, dynamic degree, aesthetic quality, imaging quality, etc.). Its objective is to give video-generation eval the rigor that text generation has, which matters because "the output looks good" is too vague for benchmark comparisons. The core concept is per-dimension metrics scored by specialized classifiers plus an aggregate score. Pick VBench as the production video-eval benchmark.

25.3.3 Audio and music training and eval

AudioSet (Google, 2017) is a 2M+ clip audio dataset with ontology-based labels covering 600+ sound classes (speech, music, vehicles, animals, etc.). Its objective is to be the ImageNet of audio: large enough for transfer learning, labeled enough for evaluation. The core concept is 10-second YouTube clips with multi-label classification. Pick AudioSet for general audio classification or sound-event detection; for speech-specific tasks, LibriSpeech is more focused.
LibriSpeech (Johns Hopkins, 2015) is the 1000-hour audiobook-derived ASR corpus that became the standard benchmark for English speech recognition. Its objective is clean read-speech evaluation, which matters as a baseline that every ASR paper reports. The core concept is dev-clean, dev-other, test-clean, test-other splits separating easy and harder conditions. Pick LibriSpeech for ASR evaluation; for production accuracy on real-world speech, supplement with noisier benchmarks like CHiME.
Mozilla Common Voice (Mozilla, 2017+) is the multilingual crowdsourced TTS/ASR dataset covering 100+ languages, the most-used open multilingual speech corpus. Its objective is to give low-resource languages a publicly-licensed speech corpus, which matters for inclusive speech tech. The core concept is per-language CC0-licensed read-aloud recordings from volunteers. Pick Common Voice for low-resource ASR or multilingual TTS training; for clean English, LibriSpeech is cleaner.
MTG-Jamendo (MTG + Jamendo, 2019) is the largest open music dataset with permissive licensing, sourced from the Jamendo independent-artist music platform. Its objective is to provide commercially-usable training data for open music models, which matters when copyright concerns rule out scraping Spotify. Pick Jamendo for open music-generation training where licensing must be defensible.

25.3.4 Comparing the datasets

Table 25.3.1: 33.3.1 Multimodal datasets (2026).

Dataset	Modality	Scale	Use
LAION-5B	Image-text	5.85B pairs	Pretraining
MS COCO	Image-caption	~330K images	Captioning eval
WebVid	Video-text	10M clips	Video pretraining
AudioSet	Audio	2M+ clips	Audio classification
LibriSpeech	Speech	1000h	ASR eval

Warning: Licensing and consent

LAION-5B has faced criticism for inclusion of copyrighted and non-consensually-collected images. Use it for research; check enterprise licenses carefully for commercial training. Synthetic-data alternatives (generated by paid models) sidestep the issue but raise their own license-propagation questions.

What's Next?

In the next section, Section 25.4: Models, we build on the material covered here.

Further Reading

Multimodal Datasets

Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS 2022. arXiv:2210.08402. The reference large-scale image-text dataset.

Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312. The reference vision-grounding dataset.

Multimodal Benchmarks

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485. Reference vision-instruction model and benchmark.