"Trillions of tokens go in, sometimes a model comes out, and then we argue for three years about which thousand of those tokens mattered."
Census, Data-Census-Pedant AI Agent
Pretraining corpora and evaluation benchmarks are the two halves of an LLM's empirical identity. This section maps the corpora that frontier labs use (RedPajama, FineWeb, Dolma, The Stack) and the benchmarks that gate every release (MMLU, MMLU-Pro, GSM8K, HumanEval, MATH, HellaSwag, ARC, BBH, IFEval, AGIEval), so when an LLM paper claims state-of-the-art you know what that benchmark measures and where its known contamination leaks live.
Prerequisites
This section assumes the pretraining objective from Section 6.2 and the deduplication-and-quality-filter pipeline from Section 6.4. The eval-leakage discussion is revisited later in the book.
Part II's datasets split into two roles: pretraining corpora (the trillions of tokens a frontier model ingested) and evaluation benchmarks (the suites researchers run to compare models). The two have very different requirements. Pretraining datasets are measured in petabytes and you almost never load them locally; benchmarks are measured in megabytes and they live in your test loop.
This section catalogues the corpora that frontier 2026 models were trained on (or claim to have been trained on), the public pretraining corpora that academic teams actually use, and the evaluation suites that have replaced GLUE as the de facto comparison.
10.9.1 Pretraining corpora
MMLU was introduced in 2020 as a 57-subject test that no model could pass; by 2024 several frontier models scored above the human expert baseline. The community's response was not to celebrate but to immediately build MMLU-Pro, GPQA-Diamond, and harder successors, on the working assumption that any benchmark a model saturates is a benchmark that has already been memorized.
- FineWeb: Hugging Face's 15 trillion-token deduplicated Common Crawl extract. The canonical open pretraining corpus of 2024-26. Comes with a 10BT, 100BT, and 1T-token sample for academic-scale work.
- FineWeb-Edu: educational-content filter of FineWeb. Smaller (1.3T tokens) but consistently outperforms FineWeb at equal compute budgets.
- Dolma (AllenAI): 3 trillion tokens; the corpus behind OLMo. Fully documented data sheet; the cleanest replication target.
- RedPajama-V2: 30 trillion tokens of web data, deduplicated. Used in training of RedPajama / Snowflake Arctic family.
- Common Crawl: the raw substrate of all the above. ~250B web pages, monthly snapshots since 2008. Almost never used directly; you want a curated derivative.
- The Stack v2 (BigCode): 67 TB of permissively-licensed source code; the canonical code-pretraining corpus.
- SmolLM3 corpus (Hugging Face, 2025) and FineWeb 2 (2024-12, multilingual): the canonical post-FineWeb releases for academic small-model and multilingual pretraining.
- Nemotron-CC (NVIDIA, 2024): a 6.3T-token filtered Common Crawl extract that competes with FineWeb on small-model pretraining quality.
- DataComp-LM (Li et al., 2024): the methodology paper behind modern corpus curation, with public DCLM-Baseline as a reproducible recipe.
10.9.2 Evaluation benchmarks
- MMLU: 57-subject multiple-choice exam, ~14k questions. The most-cited 2020-2024 LM benchmark; saturated by 2024 frontier models (~90% accuracy).
- GPQA: 448 graduate-level science questions, Google-proof. The successor to MMLU at higher difficulty.
- Arena-Hard: 500 high-difficulty open-ended prompts scored by LLM-as-judge. Correlates strongly with LM Arena Elo.
- LM Arena: blind pairwise human comparison; updated weekly. The de facto ranking of chat models.
- lm-evaluation-harness: EleutherAI's standardized harness; runs MMLU, GPQA, ARC, HellaSwag, TruthfulQA, and ~150 other benchmarks with consistent prompting.
- OpenAI simple-evals: OpenAI's lightweight harness, used in GPT model cards.
- Humanity's Last Exam (Phan et al. 2025): 2,500 multimodal expert questions; the current frontier-model benchmark.
- LiveCodeBench (Jain et al., 2024): contamination-resistant code benchmark with monthly-refreshed problems; the right code companion to SWE-bench.
- ZeroEval (Tianle Li, 2024): a small contamination-resistant suite that paper authors increasingly use as a sanity check before reporting MMLU.
- Reasoning benchmark suite: AIME 2024 / 2025, MATH-500, Codeforces Elo, and GPQA-Diamond together form the single most-cited benchmark cluster of 2025, particularly for reasoning-fine-tuned models like DeepSeek-R1 and o3.
- SimpleQA (OpenAI, 2024): the canonical "how often does the model hallucinate facts" benchmark of 2025; ~4,000 short factual questions with verified answers.
10.9.3 Comparing the pretraining corpora
| Corpus | Size | Source | Best for |
|---|---|---|---|
| FineWeb | 15T tokens | Common Crawl | Reproducing frontier pretraining |
| FineWeb-Edu | 1.3T tokens | FineWeb filter | Academic-scale pretraining |
| Dolma | 3T tokens | Mixed sources | Reproducible recipe |
| RedPajama-V2 | 30T tokens | Common Crawl + GitHub | Multilingual experiments |
| The Stack v2 | 67 TB code | Permissively-licensed GitHub | Code-pretraining |
MMLU, GPQA, HumanEval, ARC, HellaSwag: every public benchmark predates the current frontier training cycles, which means every frontier model has seen most of them. The only benchmarks where contamination can still be ruled out are post-2024 benchmarks like Humanity's Last Exam, ARC-AGI-2, and the held-out splits of Arena-Hard. Treat single-number MMLU comparisons as a rough sanity check, not a ranking.
LM Arena's human-judged pairwise comparisons are the noisiest benchmark to read but the hardest to contaminate. When a 2026 paper reports a 2-point MMLU bump, check the LM Arena Elo: if it did not move, the MMLU bump is probably training-data spillover.
The DeepSeek-V3 technical report (Dec 2024) disclosed a base-model training cost of approximately $5.6M for the V3 base on 2048 H800s, training on 14.8T tokens of curated corpus. The corpus pipeline itself, deduplication, language ID, quality filtering, takes weeks of preprocessing before the first GPU sees a token. The numeric anchor is useful when reading "this lab pretrained from scratch" claims: the data pipeline often costs more wall-clock time than the GPU run.
10.9.4 Mech-interp inspection datasets
Mech-interp work in Chapter 10 uses smaller, purpose-built datasets: Anthropic's feature exploration sets, auto-circuit's IOI prompts, and the activation-patching benchmark Easy-Transformer ships. These datasets are tiny (hundreds to thousands of prompts) and are not for measuring model capability; they are for measuring whether a specific circuit you found is the circuit the model uses.
- Pretraining corpora and evaluation benchmarks serve opposite roles: corpora live in petabytes on a cluster filesystem, benchmarks live in megabytes in your test loop, and confusing the two leads to expensive mistakes.
- FineWeb and FineWeb-Edu are the 2026 default corpora: 15T deduplicated Common Crawl tokens plus a 1.3T educational-content filter that consistently outperforms FineWeb at equal compute, with Dolma, RedPajama-V2, and Nemotron-CC as alternates.
- The Stack v2 and FineWeb 2 fill the specialty roles: 67 TB of permissively-licensed code drives code pretraining, and FineWeb 2 anchors multilingual work, with DCLM-Baseline as the reproducible curation recipe.
- The benchmark roster split from GLUE-era single-numbers: MMLU-Pro, GPQA, AIME, MATH-500, Codeforces, LiveCodeBench, Arena-Hard, and SimpleQA together form the 2025-26 evaluation cluster, with LM Arena pairwise Elo as the human-judged anchor.
- Contamination is universal on pre-2024 benchmarks: every frontier model has seen MMLU, GPQA, HumanEval, ARC, and HellaSwag, so single-number comparisons are sanity checks and Humanity's Last Exam, ARC-AGI-2, and LM Arena are the trustworthy frontier rankings.
- Mech-interp datasets are tiny and purpose-built: Anthropic feature sets, auto-circuit IOI prompts, and Easy-Transformer benchmarks measure whether a discovered circuit is the model's actual circuit, not capability ranking.
What's Next?
In the next section, Section 10.10: Models, we build on the material covered here.