Datasets & Benchmarks

Section 10.9

"Trillions of tokens go in, sometimes a model comes out, and then we argue for three years about which thousand of those tokens mattered."

CensusCensus, Data-Census-Pedant AI Agent
Big Picture

Pretraining corpora and evaluation benchmarks are the two halves of an LLM's empirical identity. This section maps the corpora that frontier labs use (RedPajama, FineWeb, Dolma, The Stack) and the benchmarks that gate every release (MMLU, MMLU-Pro, GSM8K, HumanEval, MATH, HellaSwag, ARC, BBH, IFEval, AGIEval), so when an LLM paper claims state-of-the-art you know what that benchmark measures and where its known contamination leaks live.

Prerequisites

This section assumes the pretraining objective from Section 6.2 and the deduplication-and-quality-filter pipeline from Section 6.4. The eval-leakage discussion is revisited later in the book.

Part II's datasets split into two roles: pretraining corpora (the trillions of tokens a frontier model ingested) and evaluation benchmarks (the suites researchers run to compare models). The two have very different requirements. Pretraining datasets are measured in petabytes and you almost never load them locally; benchmarks are measured in megabytes and they live in your test loop.

This section catalogues the corpora that frontier 2026 models were trained on (or claim to have been trained on), the public pretraining corpora that academic teams actually use, and the evaluation suites that have replaced GLUE as the de facto comparison.

Pretraining corpora and evaluation benchmarks: two halves of an LLM's empirical identity
Figure 10.9.1: Pretraining corpora vs evaluation benchmarks, the two halves of an LLM's empirical identity. The corpora column lists 2024 to 2026 canonical open releases used in academic and industrial pretraining: FineWeb (15T deduped Common Crawl), FineWeb-Edu (1.3T educational filter), Dolma (3T, behind OLMo), RedPajama-V2 (30T web), The Stack v2 (67 TB code), Nemotron-CC (6.3T NVIDIA filter), and SmolLM3 plus FineWeb 2 for multilingual work. The benchmark column lists the suites that gate every release: MMLU saturated by 2024, GPQA, Arena-Hard, LM Arena human Elo, LiveCodeBench, the AIME / MATH-500 / GPQA-D reasoning cluster, and HLE plus SimpleQA at the frontier.

10.9.1 Pretraining corpora

Fun Fact

MMLU was introduced in 2020 as a 57-subject test that no model could pass; by 2024 several frontier models scored above the human expert baseline. The community's response was not to celebrate but to immediately build MMLU-Pro, GPQA-Diamond, and harder successors, on the working assumption that any benchmark a model saturates is a benchmark that has already been memorized.

10.9.2 Evaluation benchmarks

10.9.3 Comparing the pretraining corpora

Table 10.9.1a: 12.3.1 Pretraining corpora used in 2026.
CorpusSizeSourceBest for
FineWeb15T tokensCommon CrawlReproducing frontier pretraining
FineWeb-Edu1.3T tokensFineWeb filterAcademic-scale pretraining
Dolma3T tokensMixed sourcesReproducible recipe
RedPajama-V230T tokensCommon Crawl + GitHubMultilingual experiments
The Stack v267 TB codePermissively-licensed GitHubCode-pretraining
Warning: benchmark contamination is universal in 2026

MMLU, GPQA, HumanEval, ARC, HellaSwag: every public benchmark predates the current frontier training cycles, which means every frontier model has seen most of them. The only benchmarks where contamination can still be ruled out are post-2024 benchmarks like Humanity's Last Exam, ARC-AGI-2, and the held-out splits of Arena-Hard. Treat single-number MMLU comparisons as a rough sanity check, not a ranking.

Tip: prefer LM Arena over any single-number benchmark

LM Arena's human-judged pairwise comparisons are the noisiest benchmark to read but the hardest to contaminate. When a 2026 paper reports a 2-point MMLU bump, check the LM Arena Elo: if it did not move, the MMLU bump is probably training-data spillover.

Real-World Scenario: What a real corpus pipeline costs

The DeepSeek-V3 technical report (Dec 2024) disclosed a base-model training cost of approximately $5.6M for the V3 base on 2048 H800s, training on 14.8T tokens of curated corpus. The corpus pipeline itself, deduplication, language ID, quality filtering, takes weeks of preprocessing before the first GPU sees a token. The numeric anchor is useful when reading "this lab pretrained from scratch" claims: the data pipeline often costs more wall-clock time than the GPU run.

10.9.4 Mech-interp inspection datasets

Mech-interp work in Chapter 10 uses smaller, purpose-built datasets: Anthropic's feature exploration sets, auto-circuit's IOI prompts, and the activation-patching benchmark Easy-Transformer ships. These datasets are tiny (hundreds to thousands of prompts) and are not for measuring model capability; they are for measuring whether a specific circuit you found is the circuit the model uses.

Key Takeaways

What's Next?

In the next section, Section 10.10: Models, we build on the material covered here.

Further Reading

Datasets

Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding" (MMLU). ICLR 2021. arXiv:2009.03300. The standard general-knowledge LLM benchmark.
Gao, L., Biderman, S., Black, S., et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027. The canonical open pretraining corpus that powers Pythia and most interpretability checkpoints; essential context for comparing model behavior across data slices.
Penedo, G., Malartic, Q., Hesslow, D., et al. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Alone." NeurIPS 2023. arXiv:2306.01116. Shows that aggressively deduplicated and filtered Common Crawl can match curated mixes; the reference for modern web-pretraining data.
Suzgun, M., Scales, N., Schaerli, N., et al. (2022). "Challenging BIG-Bench Tasks and Whether chain-of-thought Can Solve Them." arXiv:2210.09261. BIG-Bench Hard, the standard reasoning benchmark suite paired with MMLU in capability evaluations.

Interpretability Models and Datasets

Biderman, S., Schoelkopf, H., Anthony, Q. G., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." ICML 2023. arXiv:2304.01373. The Pythia model suite that powers most interpretability research; intermediate checkpoints enable training-dynamics studies.
Marks, S., & Tegmark, M. (2023). "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets." arXiv:2310.06824. Reference truth-probe dataset used in interpretability work.