Section 10.9: Datasets & Benchmarks

"Trillions of tokens go in, sometimes a model comes out, and then we argue for three years about which thousand of those tokens mattered."
Census, Data-Census-Pedant AI Agent

Big Picture

Pretraining corpora and evaluation benchmarks are the two halves of an LLM's empirical identity. This section maps the corpora that frontier labs use (RedPajama, FineWeb, Dolma, The Stack) and the benchmarks that gate every release (MMLU, MMLU-Pro, GSM8K, HumanEval, MATH, HellaSwag, ARC, BBH, IFEval, AGIEval), so when an LLM paper claims state-of-the-art you know what that benchmark measures and where its known contamination leaks live.

Prerequisites

This section assumes the pretraining objective from Section 6.2 and the deduplication-and-quality-filter pipeline from Section 6.4. The eval-leakage discussion is revisited later in the book.

Part II's datasets split into two roles: pretraining corpora (the trillions of tokens a frontier model ingested) and evaluation benchmarks (the suites researchers run to compare models). The two have very different requirements. Pretraining datasets are measured in petabytes and you almost never load them locally; benchmarks are measured in megabytes and they live in your test loop.

This section catalogues the corpora that frontier 2026 models were trained on (or claim to have been trained on), the public pretraining corpora that academic teams actually use, and the evaluation suites that have replaced GLUE as the de facto comparison.

Pretraining corpora and evaluation benchmarks: two halves of an LLM's empirical identity — **Figure 10.9.1:** Pretraining corpora vs evaluation benchmarks, the two halves of an LLM's empirical identity. The corpora column lists 2024 to 2026 canonical open releases used in academic and industrial pretraining: FineWeb (15T deduped Common Crawl), FineWeb-Edu (1.3T educational filter), Dolma (3T, behind OLMo), RedPajama-V2 (30T web), The Stack v2 (67 TB code), Nemotron-CC (6.3T NVIDIA filter), and SmolLM3 plus FineWeb 2 for multilingual work. The benchmark column lists the suites that gate every release: MMLU saturated by 2024, GPQA, Arena-Hard, LM Arena human Elo, LiveCodeBench, the AIME / MATH-500 / GPQA-D reasoning cluster, and HLE plus SimpleQA at the frontier.

10.9.1 Pretraining corpora

Fun Fact

MMLU was introduced in 2020 as a 57-subject test that no model could pass; by 2024 several frontier models scored above the human expert baseline. The community's response was not to celebrate but to immediately build MMLU-Pro, GPQA-Diamond, and harder successors, on the working assumption that any benchmark a model saturates is a benchmark that has already been memorized.

FineWeb: Hugging Face's 15 trillion-token deduplicated Common Crawl extract. The canonical open pretraining corpus of 2024-26. Comes with a 10BT, 100BT, and 1T-token sample for academic-scale work.
FineWeb-Edu: educational-content filter of FineWeb. Smaller (1.3T tokens) but consistently outperforms FineWeb at equal compute budgets.
Dolma (AllenAI): 3 trillion tokens; the corpus behind OLMo. Fully documented data sheet; the cleanest replication target.
RedPajama-V2: 30 trillion tokens of web data, deduplicated. Used in training of RedPajama / Snowflake Arctic family.
Common Crawl: the raw substrate of all the above. ~250B web pages, monthly snapshots since 2008. Almost never used directly; you want a curated derivative.
The Stack v2 (BigCode): 67 TB of permissively-licensed source code; the canonical code-pretraining corpus.
SmolLM3 corpus (Hugging Face, 2025) and FineWeb 2 (2024-12, multilingual): the canonical post-FineWeb releases for academic small-model and multilingual pretraining.
Nemotron-CC (NVIDIA, 2024): a 6.3T-token filtered Common Crawl extract that competes with FineWeb on small-model pretraining quality.
DataComp-LM (Li et al., 2024): the methodology paper behind modern corpus curation, with public DCLM-Baseline as a reproducible recipe.

10.9.2 Evaluation benchmarks

MMLU: 57-subject multiple-choice exam, ~14k questions. The most-cited 2020-2024 LM benchmark; saturated by 2024 frontier models (~90% accuracy).
GPQA: 448 graduate-level science questions, Google-proof. The successor to MMLU at higher difficulty.
Arena-Hard: 500 high-difficulty open-ended prompts scored by LLM-as-judge. Correlates strongly with LM Arena Elo.
LM Arena: blind pairwise human comparison; updated weekly. The de facto ranking of chat models.
lm-evaluation-harness: EleutherAI's standardized harness; runs MMLU, GPQA, ARC, HellaSwag, TruthfulQA, and ~150 other benchmarks with consistent prompting.
OpenAI simple-evals: OpenAI's lightweight harness, used in GPT model cards.
Humanity's Last Exam (Phan et al. 2025): 2,500 multimodal expert questions; the current frontier-model benchmark.
LiveCodeBench (Jain et al., 2024): contamination-resistant code benchmark with monthly-refreshed problems; the right code companion to SWE-bench.
ZeroEval (Tianle Li, 2024): a small contamination-resistant suite that paper authors increasingly use as a sanity check before reporting MMLU.
Reasoning benchmark suite: AIME 2024 / 2025, MATH-500, Codeforces Elo, and GPQA-Diamond together form the single most-cited benchmark cluster of 2025, particularly for reasoning-fine-tuned models like DeepSeek-R1 and o3.
SimpleQA (OpenAI, 2024): the canonical "how often does the model hallucinate facts" benchmark of 2025; ~4,000 short factual questions with verified answers.

10.9.3 Comparing the pretraining corpora

Table 10.9.1a: 12.3.1 Pretraining corpora used in 2026.

Corpus	Size	Source	Best for
FineWeb	15T tokens	Common Crawl	Reproducing frontier pretraining
FineWeb-Edu	1.3T tokens	FineWeb filter	Academic-scale pretraining
Dolma	3T tokens	Mixed sources	Reproducible recipe
RedPajama-V2	30T tokens	Common Crawl + GitHub	Multilingual experiments
The Stack v2	67 TB code	Permissively-licensed GitHub	Code-pretraining

Warning: benchmark contamination is universal in 2026

MMLU, GPQA, HumanEval, ARC, HellaSwag: every public benchmark predates the current frontier training cycles, which means every frontier model has seen most of them. The only benchmarks where contamination can still be ruled out are post-2024 benchmarks like Humanity's Last Exam, ARC-AGI-2, and the held-out splits of Arena-Hard. Treat single-number MMLU comparisons as a rough sanity check, not a ranking.

Tip: prefer LM Arena over any single-number benchmark

LM Arena's human-judged pairwise comparisons are the noisiest benchmark to read but the hardest to contaminate. When a 2026 paper reports a 2-point MMLU bump, check the LM Arena Elo: if it did not move, the MMLU bump is probably training-data spillover.

Real-World Scenario: What a real corpus pipeline costs

The DeepSeek-V3 technical report (Dec 2024) disclosed a base-model training cost of approximately $5.6M for the V3 base on 2048 H800s, training on 14.8T tokens of curated corpus. The corpus pipeline itself, deduplication, language ID, quality filtering, takes weeks of preprocessing before the first GPU sees a token. The numeric anchor is useful when reading "this lab pretrained from scratch" claims: the data pipeline often costs more wall-clock time than the GPU run.

10.9.4 Mech-interp inspection datasets

Mech-interp work in Chapter 10 uses smaller, purpose-built datasets: Anthropic's feature exploration sets, auto-circuit's IOI prompts, and the activation-patching benchmark Easy-Transformer ships. These datasets are tiny (hundreds to thousands of prompts) and are not for measuring model capability; they are for measuring whether a specific circuit you found is the circuit the model uses.

Key Takeaways

Pretraining corpora and evaluation benchmarks serve opposite roles: corpora live in petabytes on a cluster filesystem, benchmarks live in megabytes in your test loop, and confusing the two leads to expensive mistakes.
FineWeb and FineWeb-Edu are the 2026 default corpora: 15T deduplicated Common Crawl tokens plus a 1.3T educational-content filter that consistently outperforms FineWeb at equal compute, with Dolma, RedPajama-V2, and Nemotron-CC as alternates.
The Stack v2 and FineWeb 2 fill the specialty roles: 67 TB of permissively-licensed code drives code pretraining, and FineWeb 2 anchors multilingual work, with DCLM-Baseline as the reproducible curation recipe.
The benchmark roster split from GLUE-era single-numbers: MMLU-Pro, GPQA, AIME, MATH-500, Codeforces, LiveCodeBench, Arena-Hard, and SimpleQA together form the 2025-26 evaluation cluster, with LM Arena pairwise Elo as the human-judged anchor.
Contamination is universal on pre-2024 benchmarks: every frontier model has seen MMLU, GPQA, HumanEval, ARC, and HellaSwag, so single-number comparisons are sanity checks and Humanity's Last Exam, ARC-AGI-2, and LM Arena are the trustworthy frontier rankings.
Mech-interp datasets are tiny and purpose-built: Anthropic feature sets, auto-circuit IOI prompts, and Easy-Transformer benchmarks measure whether a discovered circuit is the model's actual circuit, not capability ranking.

What's Next?

In the next section, Section 10.10: Models, we build on the material covered here.

Further Reading

Datasets

Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding" (MMLU). ICLR 2021. arXiv:2009.03300. The standard general-knowledge LLM benchmark.

Gao, L., Biderman, S., Black, S., et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027. The canonical open pretraining corpus that powers Pythia and most interpretability checkpoints; essential context for comparing model behavior across data slices.

Penedo, G., Malartic, Q., Hesslow, D., et al. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Alone." NeurIPS 2023. arXiv:2306.01116. Shows that aggressively deduplicated and filtered Common Crawl can match curated mixes; the reference for modern web-pretraining data.

Suzgun, M., Scales, N., Schaerli, N., et al. (2022). "Challenging BIG-Bench Tasks and Whether chain-of-thought Can Solve Them." arXiv:2210.09261. BIG-Bench Hard, the standard reasoning benchmark suite paired with MMLU in capability evaluations.

Interpretability Models and Datasets

Biderman, S., Schoelkopf, H., Anthony, Q. G., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." ICML 2023. arXiv:2304.01373. The Pythia model suite that powers most interpretability research; intermediate checkpoints enable training-dynamics studies.

Marks, S., & Tegmark, M. (2023). "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets." arXiv:2310.06824. Reference truth-probe dataset used in interpretability work.