Section 61.3

Datasets and Benchmarks

"Fifteen trillion tokens, deduplicated three ways, mixed by intuition, and still the validation loss insists you should have added more code."

LexicaLexica, Corpus-Curating AI Agent
Big Picture

Training-scale datasets and the benchmarks that measure training performance are a different layer of the stack from the eval benchmarks covered earlier in the book. Datasets in this chapter mean pretraining corpora measured in trillions of tokens (FineWeb, FineWeb-Edu, RedPajama-v2, The Pile, C4, mC4, OSCAR, raw CommonCrawl), instruction-tuning corpora (ShareGPT, UltraFeedback, OpenAssistant, Alpaca, FLAN, OpenHermes), and multimodal corpora (LAION-5B, DataComp, CulturaX). Benchmarks here mean training-system benchmarks (MLPerf Training), evaluation harnesses run at scale (EleutherAI lm-eval-harness), and the compute-throughput profilers (MFU calculators, NCCL test suites, FLOPS profilers) without which "is my training run efficient?" is unanswerable. The eval-of-models benchmarks (MMLU, GSM8K, SWE-bench) are covered in their model-evaluation chapters; this section covers the systems and pretraining-side data and benchmarks.

At training scale the dataset is the second-largest engineering artifact after the model itself. A 15-trillion-token deduplicated, filtered, mixed pretraining corpus is the kind of thing that takes months to assemble and is rarely fully open. The 2023-2026 trend has been toward more openness at the corpus level: FineWeb (2024) and FineWeb-Edu (2024) are 15T-token and 1.3T-token open releases respectively, sufficient for a serious open-weights pretraining run. The instruction-tuning side is similarly fragmented across many medium-sized datasets, each contributing a particular style of data. The benchmark side is where production teams gate releases: MLPerf Training measures throughput across submissions, lm-evaluation-harness measures the trained model.

61.3.1 Open pretraining corpora

Open pretraining corpora are the foundation of open-weight model development. The 2024-2026 standards are FineWeb, FineWeb-Edu, RedPajama-v2, and (as the historical reference) The Pile.

61.3.2 Instruction-tuning and alignment datasets

After pretraining comes the supervised fine-tuning (SFT) and preference-tuning (DPO / PPO / GRPO) stages. The 2024-2026 datasets that anchor this layer:

61.3.3 Multimodal corpora

Vision-language and multimodal pretraining has its own canonical datasets, distinct from the text corpora above.

61.3.4 Training and systems benchmarks

Where evaluation benchmarks measure trained models, training benchmarks measure the training system. The 2026 standards are MLPerf Training, the lm-evaluation-harness scale subsets, and the compute-side profilers.

61.3.5 Compute profilers and MFU tooling

Profiling tools are the bridge between "training works" and "training is efficient." The 2026 stack:

Warning
Pretraining-data contamination is a real and growing problem

Benchmark contamination, the presence of test set examples in pretraining data, has become unmistakable in 2024-2026 as benchmarks like MMLU, GSM8K, and HumanEval are widely repeated in web data. The 2024 paper "Investigating Data Contamination for Pretraining Language Models" (Jiang et al.) documented that several closed and open models showed evidence of contamination. The practical implications: (a) on public benchmarks, expect rankings to be partially contamination-driven and verify with held-out benchmarks; (b) when building pretraining corpora, run contamination checks against the benchmarks you care about and report decontamination procedure in your dataset documentation; (c) when releasing benchmarks, design them with contamination-resistance in mind (held-out partitions, canary strings, or post-2024 dates that pretraining data cannot have seen).

61.3.6 Dataset mixing and data-pipeline considerations

A frontier-scale pretraining corpus is rarely a single dataset; it is a mixture with carefully chosen proportions of web (FineWeb / FineWeb-Edu), code (The Stack), math (specialized corpora like Proof-Pile-2, OpenWebMath), academic text (arXiv, S2ORC), books (Books3 historically, replacements in newer training), and reference (Wikipedia). The 2024 Llama-3 disclosure indicated roughly 50 percent general knowledge, 25 percent math and reasoning, 17 percent code, and 8 percent multilingual, with deduplication and quality-filter passes. The Falcon team's 2024 report documented similar mixing. The relevant data-pipeline tools:

Library Shortcut
datasets streaming for trillion-token corpora

The Hugging Face datasets library (v3.0+, 2024 to 2026) is the default Python entry point to every open pretraining corpus. Pass streaming=True and you get an IterableDataset backed by HTTP range reads against the Hub, so you can iterate FineWeb's 15 trillion tokens without ever materializing the 30+ TB locally. Compose .shuffle(buffer_size=...), .filter(...), and .map(tokenizer, batched=True) for an end-to-end streaming pretraining loader, then split by rank with .shard(num_shards=world_size, index=rank).

Show code
pip install -U datasets
from datasets import load_dataset
from transformers import AutoTokenizer

# Iterate 15 T tokens of FineWeb-Edu without downloading the corpus.
ds = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    name="sample-100BT",
    split="train",
    streaming=True,
)

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
def tokenize(batch):
    return tok(batch["text"], truncation=True, max_length=8192)

ds = (ds
    .shuffle(seed=42, buffer_size=10_000)
    .map(tokenize, batched=True, remove_columns=["text"])
    .shard(num_shards=world_size, index=rank))

for example in ds:
    input_ids = example["input_ids"]
    # feed into your trainer
Code Fragment 61.3.6.1: Trillion-token streaming pretraining loader in 15 lines.

61.3.7 Mapping the data and benchmark stack

LLM data and benchmark stack as of 2026 covering pretraining corpora, instruction and preference datasets, evaluation suites, and metadata or hub services with representative providers per lane
Figure 61.3.1: The 2026 LLM data and benchmark stack, spanning crawl-scale corpora, streaming pretraining loaders, evaluation suites, and benchmark hubs that anchor reproducible scaling work.

61.3.8 Eval suites for production training decisions

The standard practice in 2026 production training is to maintain three layers of evaluation:

The cost gradient is roughly 10:100:1000: cheap streaming, moderate checkpoint, expensive release. Production teams budget compute for each layer separately.

Key Insight
"Compute the test loss on the same shard you trained on" is the most common bug

A surprising number of training runs are bugged because the held-out validation shard accidentally overlaps with training data, producing artificially-low validation loss that masks overfitting. The defensive practice: deterministically partition the dataset by document hash into train / val / test before any training, never mix; keep the val set small enough to evaluate frequently but never use it to make hyperparameter decisions (use a separate dev set for that); use test only at the end. This is dataset hygiene 101, but at the scale of trillion-token corpora distributed across many shards on multiple object stores, it is also where bugs hide.

61.3.9 Dataset licensing and attribution

An aspect that has become much more important in 2024-2026 with the rise of training-data litigation (NYT v OpenAI, Bartz v Anthropic, the various class actions) is dataset licensing. The practical 2026 landscape:

Document your dataset provenance and licensing decisions; the legal landscape is evolving and the record-keeping is now part of the engineering work.

61.3.10 Scaling laws and data-budget planning

The training-data side intersects with scaling laws in concrete ways. The Chinchilla scaling law (Hoffmann et al., 2022) found that compute-optimal training mixes parameters and tokens at roughly 1:20 (a 70B model is compute-optimal at ~1.4T tokens). Subsequent work has shown that overtraining past Chinchilla-optimal continues to improve performance, just less compute-efficiently per FLOP. The 2024-2026 frontier models train substantially past Chinchilla-optimal: Llama-3.1 405B at ~15T tokens is ~37 tokens per parameter; Llama-3.1 8B at 15T is ~1875 tokens per parameter. The reasoning is that token-cost is much lower than parameter-cost at deployment (a 405B model is expensive to serve, regardless of training); training the same parameters longer is a cheap insurance policy.

This affects data-curation requirements: at 15T+ token budgets, even FineWeb (15T tokens) and RedPajama-v2 (30T tokens) are not infinite budgets. Repeating tokens (epoching) is increasingly common; the 2024-2026 work (Muennighoff et al., 2023; subsequent FineWeb ablations) suggests up to 4-8 epochs of high-quality data outperforms 1 epoch of mixed-quality data at the same compute. The practical implication: curate aggressively, then repeat the curated corpus, rather than diluting with low-quality data to reach a single-epoch token budget.

Numeric Example
the $2-3M sticker price of a 70B 1T-token pretrain

What does a "modest" 70B pretrain actually cost in 2026? The Kaplan FLOPs identity $C \approx 6 P T$ for a dense transformer with $P = 70 \times 10^9$ parameters and $T = 1 \times 10^{12}$ tokens gives $C \approx 4.2 \times 10^{23}$ FLOPs. Run on H100 SXM5 at $\pi_{\text{peak}} = 989$ TFLOPS BF16 and an achievable $\mathrm{MFU} = 0.45$, the wall-clock GPU-seconds is $C / (\pi_{\text{peak}} \cdot \mathrm{MFU}) = 4.2 \times 10^{23} / (989 \times 10^{12} \cdot 0.45) \approx 9.43 \times 10^{8}$ GPU-seconds, or $\approx 262{,}000$ GPU-hours. At the 2026 specialized-cloud rate of $2.40 \,\text{USD}/\text{GPU-hour}$ (CoreWeave / Lambda reserved tier) the headline compute bill is $262{,}000 \times 2.40 \approx 629{,}000$ USD.

The headline understates the real cost. Add: (a) data preparation, $\sim \$50$-$150$K for tokenization, dedup, filter, and shuffle of 1T+ tokens on a CPU cluster; (b) ablations and burn-in runs, typically 20-40% of the production-run compute spent on small-scale recipes and 1-hour sanity passes, so $\$130$-$250$K; (c) fault-recovery overhead, even at 90% useful-time (10% checkpoint / restart / failure), is another 10% on the headline, $\$63$K; (d) storage and egress, 6-10 TB of checkpoints + 30-60 TB of dataset at FSx-for-Lustre and S3 rates, $\$30$-$80$K over the run; (e) engineering time, 2-4 senior engineers for 2-3 months at fully-loaded $\$50$K/month, $\$200$-$600$K. The all-in landed cost of a "$629K compute" pretrain is therefore $\$1.5$-$3$M in 2026, which is why "we'll train our own 70B" still requires a venture-capital-class budget rather than a research-grant-class one. Doubling to a 2T-token run roughly doubles compute and storage but barely moves engineering, so the elasticity is more favorable at larger scales.

61.3.11 Canary strings, decontamination, and the test-set integrity problem

The 2024-2026 standard practice for measuring contamination is the canary string: a unique random token sequence embedded in a benchmark's documentation that should never appear in pretraining data. The BIG-Bench canary (a specific UUID-like string) is one example; the lm-evaluation-harness benchmarks increasingly include canaries. The protocol: at evaluation time, the team checks whether the trained model can verbatim-complete the canary; if it can, the benchmark is presumed contaminated and the result is reported with a contamination flag.

The 2024-2026 contamination-checking tools:

61.3.12 Data cards and provenance documentation

The 2018 "Datasheets for Datasets" paper (Gebru et al.) introduced the data-card concept; the 2024-2026 practice has matured into a near-standard documentation format. The fields that production teams now expect in a data card:

FineWeb, RedPajama-v2, and The Stack v2 ship exemplary data cards; many older corpora (raw CommonCrawl, early LAION) do not, which complicates downstream auditing. Document the provenance of your dataset choices in your training writeup.

61.3.13 Dataset evaluation checklist

When evaluating a corpus for use in training, the questions that surface real differences:

What's Next?

In the next section, Section 61.4: Models, we build on the material covered here.

Further Reading
Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv preprint arXiv:2406.17557. arxiv.org/abs/2406.17557. The FineWeb paper documenting the 15T-token open corpus and the filtering ablations behind FineWeb-Edu.
Gao, L. et al. (2020). "The Pile: An 825GB Dataset of Diverse Text for Language Modeling." arXiv preprint arXiv:2101.00027. arxiv.org/abs/2101.00027. The original Pile paper; the reference for diverse-source open pretraining corpora that shaped 2021-2023 work.
Cui, G. et al. (2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." arXiv preprint arXiv:2310.01377. arxiv.org/abs/2310.01377. The UltraFeedback paper; the canonical reference for open preference-tuning data.
Gadre, S.Y. et al. (2023). "DataComp: In search of the next generation of multimodal datasets." NeurIPS 2023. arxiv.org/abs/2304.14108. The DataComp paper demonstrating dataset curation as a benchmark in its own right; canonical reference for VLM data curation.
MLCommons (2024). "MLPerf Training v4.0 Results." MLCommons Benchmarks. mlcommons.org/benchmarks/training. The MLPerf Training v4.0 results announcement; the reference for cross-vendor training-system comparison.
Gao, L. et al. (2024). "A framework for few-shot language model evaluation." Zenodo. zenodo.org/records/12608602. The reference citation for lm-evaluation-harness, the canonical open evaluation framework.