"Fifteen trillion tokens, deduplicated three ways, mixed by intuition, and still the validation loss insists you should have added more code."
Lexica, Corpus-Curating AI Agent
Training-scale datasets and the benchmarks that measure training performance are a different layer of the stack from the eval benchmarks covered earlier in the book. Datasets in this chapter mean pretraining corpora measured in trillions of tokens (FineWeb, FineWeb-Edu, RedPajama-v2, The Pile, C4, mC4, OSCAR, raw CommonCrawl), instruction-tuning corpora (ShareGPT, UltraFeedback, OpenAssistant, Alpaca, FLAN, OpenHermes), and multimodal corpora (LAION-5B, DataComp, CulturaX). Benchmarks here mean training-system benchmarks (MLPerf Training), evaluation harnesses run at scale (EleutherAI lm-eval-harness), and the compute-throughput profilers (MFU calculators, NCCL test suites, FLOPS profilers) without which "is my training run efficient?" is unanswerable. The eval-of-models benchmarks (MMLU, GSM8K, SWE-bench) are covered in their model-evaluation chapters; this section covers the systems and pretraining-side data and benchmarks.
At training scale the dataset is the second-largest engineering artifact after the model itself. A 15-trillion-token deduplicated, filtered, mixed pretraining corpus is the kind of thing that takes months to assemble and is rarely fully open. The 2023-2026 trend has been toward more openness at the corpus level: FineWeb (2024) and FineWeb-Edu (2024) are 15T-token and 1.3T-token open releases respectively, sufficient for a serious open-weights pretraining run. The instruction-tuning side is similarly fragmented across many medium-sized datasets, each contributing a particular style of data. The benchmark side is where production teams gate releases: MLPerf Training measures throughput across submissions, lm-evaluation-harness measures the trained model.
61.3.1 Open pretraining corpora
Open pretraining corpora are the foundation of open-weight model development. The 2024-2026 standards are FineWeb, FineWeb-Edu, RedPajama-v2, and (as the historical reference) The Pile.
- FineWeb and FineWeb-2 (Hugging Face, 2024-2025) is the 15-trillion-token English-language web corpus filtered from 96 CommonCrawl snapshots, published with full filtering code and ablation studies. Its objective is to provide an open, fully reproducible alternative to the closed FineWeb-class corpora used by Llama, Mistral, and others, which matters because before FineWeb, open-weight pretraining at frontier scale required teams to assemble their own corpus from raw CommonCrawl. The core technique is heuristic + classifier-based filtering plus exact and fuzzy deduplication; the 2025 FineWeb-2 release extends to 1000+ languages with similar filtering. Pick FineWeb as the default open pretraining corpus when training from scratch and English is the primary language; pick FineWeb-2 for multilingual coverage.
- FineWeb-Edu (Hugging Face, 2024) is the 1.3-trillion-token education-quality subset of FineWeb, filtered with a Llama-3-70B-as-judge classifier scoring documents on educational value. Its objective is to be the highest-quality slice of open web data, which matters because token-for-token, training on FineWeb-Edu has been shown to outperform training on raw FineWeb on knowledge-heavy benchmarks. Pick FineWeb-Edu as the high-quality core of a pretraining mixture; combine with FineWeb for general coverage and with code / math corpora for those domains.
- RedPajama-v2 (Together AI / Ontocord, 2023; v2 with 30T+ tokens) is Together's open reproduction of the Llama-1 pretraining corpus, with v2 expanding to 30+ trillion tokens across 84 CommonCrawl snapshots and rich quality-classifier metadata enabling custom filtering. Its objective is to be the open, fully-versioned pretraining corpus with extensive metadata for experimentation, which matters when you want to do your own quality filtering rather than accept FineWeb's choices. Pick RedPajama-v2 when you want to experiment with custom filtering / mixing recipes; for off-the-shelf training, FineWeb is more directly usable.
- The Pile (EleutherAI, 2020) is the original ~825GB open pretraining corpus assembled by EleutherAI for the GPT-Neo and Pythia training runs. Although superseded in scale by FineWeb and RedPajama-v2, The Pile remains valuable as the canonical reference corpus for reproducing 2021-2023-era research and as the basis for the Pythia interpretability suite. Pick The Pile for reproductions of older research and for interpretability work that requires the Pile-trained Pythia checkpoints; for new pretraining, the 2024+ corpora dominate.
- C4 (Colossal Clean Crawled Corpus) (Google, 2020; T5 paper) is the ~750GB English web corpus introduced by the T5 paper, derived from one CommonCrawl snapshot with heuristic filtering. C4 is the canonical baseline corpus referenced in many subsequent papers and was used in early Llama. Pick C4 for direct comparisons with T5 / Llama-1 baselines; for new training, larger and better-filtered corpora dominate.
- mC4 (multilingual C4) (Google, 2021; mT5 paper) is the 101-language extension of C4, the canonical reference for multilingual pretraining experiments. For new multilingual training, FineWeb-2 and CulturaX are better-curated; mC4 remains useful for comparison.
- OSCAR (Inria / Sorbonne, 2019; ongoing) is the long-running multilingual CommonCrawl corpus from the OSCAR project, covering ~150 languages and published in versioned releases. Pick OSCAR for multilingual research where the well-documented language-detection and quality classifiers are valuable; for sheer scale, CulturaX (which incorporates OSCAR and mC4) is larger.
- CommonCrawl raw archives (CommonCrawl Foundation, 2008+) is the underlying source for nearly all open pretraining corpora: a non-profit-published crawl of the public web in monthly snapshots, totaling petabytes of WARC archives. Pick CommonCrawl raw only when you specifically need to filter from scratch (e.g., for a domain-specific corpus or a custom quality classifier); for production pretraining, the filtered corpora (FineWeb, RedPajama-v2) save months of work.
- The Stack v2 (BigCode / Hugging Face, 2024) is the 3+ TB code-pretraining corpus from 600+ programming languages, the open reference for code-LLM pretraining. Pick The Stack v2 when training a code model or adding code to a general pretraining mixture; for general pretraining mixtures, 5 to 20 percent code is the typical recipe.
- CulturaX (UNCC / Hugging Face, 2023) is a 6.3-trillion-token multilingual corpus aggregating mC4 and OSCAR with extensive cleaning. Pick CulturaX for multilingual pretraining when you want a single off-the-shelf large multilingual corpus; for the newest curation, FineWeb-2 is competitive.
61.3.2 Instruction-tuning and alignment datasets
After pretraining comes the supervised fine-tuning (SFT) and preference-tuning (DPO / PPO / GRPO) stages. The 2024-2026 datasets that anchor this layer:
- ShareGPT (community scrapes, 2023) is the collection of user-shared ChatGPT conversations that became the basis for Vicuna and many other early open instruction-tuned models. Its objective was to capture real-world ChatGPT interactions for SFT, which matters historically as the first large-scale conversational SFT corpus. Pick ShareGPT-derived datasets when you specifically want ChatGPT-style conversational data; for higher-quality alternatives, newer datasets are better-curated and less terms-of-service-encumbered.
- UltraFeedback (Tsinghua, 2023; binarized variant 2024) is the 64K-prompt preference dataset generated by aggregating multiple model responses and scoring with GPT-4-as-judge. Its objective is to provide a high-quality DPO training corpus that is open and reproducible, which matters because before UltraFeedback, DPO training required proprietary preference data. The binarized variant (Hugging Face H4) became the canonical DPO baseline for the Zephyr and many subsequent open alignment efforts. Pick UltraFeedback as the default DPO dataset for general preference tuning; for domain-specific alignment, generate task-specific preference data.
- OpenAssistant Conversations (OASST1 / OASST2) (LAION et al., 2023-2024) is the open conversational SFT corpus produced by a crowdsourced effort, with multi-turn conversations, multiple replies per turn, and human-quality ratings. Its objective is to be the canonical open conversational SFT dataset with auditable provenance (every message is attributed to a specific human contributor), which matters when you want a clean SFT dataset without ChatGPT-derived data. Pick OASST when terms-of-service cleanliness matters and conversational SFT is the goal.
- Alpaca (Stanford, 2023) is the 52K instruction-tuning dataset generated by Stanford using GPT-3.5 to self-instruct from a seed of 175 human-written tasks. Historically important as the dataset behind Stanford Alpaca, the first widely-replicated open instruction-tuned model based on Llama. Pick Alpaca for reproductions and tutorials; for production SFT, newer and larger datasets dominate.
- FLAN collection (Google, 2021; FLAN v2 in 2022) is Google's instruction-tuning megamix of 1,800+ tasks formatted as instructions, used to train FLAN-T5 and FLAN-PaLM. Its objective is to be the canonical academic-task instruction-tuning corpus, which matters because instruction-tuning with FLAN reliably improves zero-shot generalization. Pick FLAN when you want broad academic-task coverage in SFT; combine with conversational and open-ended datasets for general assistants.
- OpenHermes-2.5 (Nous Research, 2024) is the community-curated 1M-conversation SFT dataset combining ShareGPT-derived, synthetic, and reasoning-trace data, the basis for the Hermes-2.5 and Hermes-3 model families. Pick OpenHermes as a strong off-the-shelf general SFT mixture; many open-weight assistants are competitive on it.
- No Robots (Hugging Face, 2023) is a 10K-conversation high-quality SFT dataset written by skilled human annotators rather than generated by models. Pick No Robots when you specifically want human-written rather than model-generated SFT data (for example, to avoid model-collapse style failure modes on synthetic-only training).
- SmolTalk (Hugging Face, 2024) is the synthetic conversational SFT dataset used to train the SmolLM2 family, with carefully curated synthetic generation for math, code, instruction-following, and general conversation. Pick SmolTalk as a recent reference for synthetic-first SFT corpora that nonetheless deliver competitive trained models.
61.3.3 Multimodal corpora
Vision-language and multimodal pretraining has its own canonical datasets, distinct from the text corpora above.
- LAION-5B (LAION, 2022) is the 5.85-billion-image-text-pair web-scraped dataset that became the standard reference for vision-language pretraining (CLIP, Stable Diffusion, many VLMs). Its objective is to be the open, scale-equivalent dataset to the proprietary OpenAI / Google web-scale image-text data, which matters because before LAION, open VLM training was capacity-limited. Pick LAION-5B for VLM pretraining; expect to apply substantial additional filtering (the raw dataset has known content and quality issues).
- DataComp / DataComp-1B (LAION et al., 2023) is the dataset-curation benchmark plus the curated 1.4B-pair subset that demonstrated dataset curation matters more than dataset scale beyond a threshold. Its objective is to be the canonical "how good can you make image-text data via filtering" benchmark, which matters because the DataComp methodology has been adopted by most subsequent VLM training. Pick DataComp's curated subsets when training VLMs and curation quality dominates scale.
- OBELICS (Hugging Face, 2023) is the 141M-document interleaved image-text web corpus, with 353M images and 115B text tokens preserving the natural interleaving structure of web pages. Its objective is to be the open analog of the Flamingo / IDEFICS-style multimodal interleaved pretraining data, which matters for models that need to handle multiple images per document. Pick OBELICS for interleaved multimodal training; for paired image-text only, LAION-5B / DataComp are simpler.
- ShareGPT4V and ShareGPT4o (multiple authors, 2023-2024) are GPT-4V / GPT-4o-generated image caption corpora, used for VLM SFT to improve description and reasoning capabilities. Pick when training a VLM with strong description quality; expect to combine with grounded-VQA datasets for question-answering.
61.3.4 Training and systems benchmarks
Where evaluation benchmarks measure trained models, training benchmarks measure the training system. The 2026 standards are MLPerf Training, the lm-evaluation-harness scale subsets, and the compute-side profilers.
- MLPerf Training (MLCommons, 2018; v3.x and v4.x in 2024-2025) is the standardized industry benchmark suite for training systems, with rules-bound submissions from NVIDIA, Google, Intel, Habana, and others. Its objective is to provide an apples-to-apples comparison of training systems on standardized models and datasets (BERT-Large, GPT-3 175B, Stable Diffusion, Llama-2 70B LoRA, Llama-3.1 405B in 2024-25), which matters because vendor-reported throughput numbers are otherwise un-comparable. Pick MLPerf Training results as the reference for cross-vendor system comparisons; for your specific workload, run your own benchmarks (MLPerf models are not always representative of your job).
- EleutherAI lm-evaluation-harness (EleutherAI, 2021) is the canonical open-source evaluation harness for language models, supporting hundreds of benchmarks (MMLU, BIG-Bench, HellaSwag, etc.) via a unified API. Its objective is to be the reproducible reference implementation for model evaluation, which matters because subtle differences in benchmark implementation (chain-of-thought versus not, exact prompt format, log-likelihood versus generation) substantially change scores. Pick lm-evaluation-harness as the default for evaluating language models, especially when comparing across papers; the Open LLM Leaderboard uses it as its scoring engine.
- OpenAI Evals (OpenAI, 2023) is OpenAI's open-source evaluation framework, more flexible than lm-evaluation-harness for custom evals (especially LLM-as-judge style). Pick OpenAI Evals for custom evals that need LLM-as-judge or registry-based authoring; for standard academic benchmarks, lm-evaluation-harness is the canonical implementation.
- NCCL Tests (NVIDIA, 2017+) is the canonical benchmark suite for testing NCCL collective communication performance on a given cluster. Its objective is to measure all-reduce, all-gather, reduce-scatter, broadcast, and other collective bandwidth in isolation, which matters as the first diagnostic when your training run shows lower-than-expected throughput. Pick nccl-tests as the first benchmark on any new training cluster to verify the interconnect; gaps between achieved and theoretical NCCL bandwidth indicate fabric or driver issues that will haunt the training run.
- HELM (Holistic Evaluation of Language Models) (Stanford CRFM, 2022) is Stanford's broader evaluation framework covering many dimensions beyond raw accuracy (calibration, robustness, fairness, bias, toxicity, efficiency). Pick HELM when the evaluation question is broader than "what is the MMLU score?"; for headline benchmarks, lm-evaluation-harness is more common.
61.3.5 Compute profilers and MFU tooling
Profiling tools are the bridge between "training works" and "training is efficient." The 2026 stack:
- NVIDIA Nsight Systems and Nsight Compute (NVIDIA, 2018+) is the canonical NVIDIA profiler pair: Nsight Systems for system-wide timeline profiling (kernel launches, NCCL collectives, host-to-device transfers), Nsight Compute for kernel-level detail (warp occupancy, SM efficiency, memory bandwidth). Pick Nsight Systems as the first profiler for "why is my training slow?" (it shows whether you are compute-bound, communication-bound, or stall-bound); Nsight Compute for specific kernel optimization.
- PyTorch Profiler (Meta, 2020+) is the in-tree PyTorch profiler with TensorBoard and Chrome trace exports. Its objective is to provide PyTorch-aware profiling (per-operator timing, per-call CUDA kernel breakdown) without external tooling, which matters for the common case where Nsight is overkill. Pick PyTorch Profiler for routine performance debugging; drop to Nsight when you need GPU-internal detail.
- MFU calculators and FLOPS estimators (various; Levanter docs, Megatron-LM scripts, Google's MaxText) are the small utility scripts that compute the theoretical FLOPs of your model architecture, divide by elapsed seconds, and report achieved MFU. The objective is to convert raw step times into the cross-comparable MFU metric; the canonical formula for a dense transformer is roughly 6 * N_params * tokens_per_step per training step (forward + backward + parameter update). Pick the formula in the Levanter and Megatron docs as the reference; report MFU in every training writeup.
- EleutherAI Cookbook (EleutherAI, 2024+) is the EleutherAI repository of reference calculations, training recipes, and scale-up guides including FLOPS counters and parallelism-strategy calculators. Pick the Cookbook as the practical reference when planning a training run; the cost and timing calculators alone save substantial back-of-envelope work.
- Triton's `proton` profiler (OpenAI, 2024+) is the Triton-specific profiler for Triton-generated kernels (which include most torch.compile output and FlashAttention). Pick proton when you specifically need to profile Triton kernel internals; for general profiling, Nsight or PyTorch Profiler are broader.
Benchmark contamination, the presence of test set examples in pretraining data, has become unmistakable in 2024-2026 as benchmarks like MMLU, GSM8K, and HumanEval are widely repeated in web data. The 2024 paper "Investigating Data Contamination for Pretraining Language Models" (Jiang et al.) documented that several closed and open models showed evidence of contamination. The practical implications: (a) on public benchmarks, expect rankings to be partially contamination-driven and verify with held-out benchmarks; (b) when building pretraining corpora, run contamination checks against the benchmarks you care about and report decontamination procedure in your dataset documentation; (c) when releasing benchmarks, design them with contamination-resistance in mind (held-out partitions, canary strings, or post-2024 dates that pretraining data cannot have seen).
61.3.6 Dataset mixing and data-pipeline considerations
A frontier-scale pretraining corpus is rarely a single dataset; it is a mixture with carefully chosen proportions of web (FineWeb / FineWeb-Edu), code (The Stack), math (specialized corpora like Proof-Pile-2, OpenWebMath), academic text (arXiv, S2ORC), books (Books3 historically, replacements in newer training), and reference (Wikipedia). The 2024 Llama-3 disclosure indicated roughly 50 percent general knowledge, 25 percent math and reasoning, 17 percent code, and 8 percent multilingual, with deduplication and quality-filter passes. The Falcon team's 2024 report documented similar mixing. The relevant data-pipeline tools:
- Hugging Face datasets: the canonical Python library for loading and streaming massive datasets without materializing them locally. Used by virtually every open-source training stack.
- DataTrove (Hugging Face, 2024): the framework Hugging Face used to build FineWeb, providing scalable filtering, deduplication, and quality-classification pipelines. Pick when you need to build or modify a large pretraining corpus.
- RedPajama-Data pipeline: Together's open implementation of the RedPajama curation pipeline; an alternative to DataTrove with different filtering choices.
- MosaicML Streaming (Databricks, 2022): a streaming-friendly dataset format optimized for large-scale training that loads shards from object storage with deterministic ordering and resumability. Pick when you need streaming over multi-petabyte corpora during long training runs.
- WebDataset (NVIDIA, 2020+): a tar-based streaming dataset format widely used in multimodal pretraining (Stable Diffusion training, LAION work). Pick when your data is naturally shardable into tar files and a POSIX filesystem cannot keep up.
The Hugging Face datasets library (v3.0+, 2024 to 2026) is the default Python entry point to every open pretraining corpus. Pass streaming=True and you get an IterableDataset backed by HTTP range reads against the Hub, so you can iterate FineWeb's 15 trillion tokens without ever materializing the 30+ TB locally. Compose .shuffle(buffer_size=...), .filter(...), and .map(tokenizer, batched=True) for an end-to-end streaming pretraining loader, then split by rank with .shard(num_shards=world_size, index=rank).
Show code
pip install -U datasets
from datasets import load_dataset
from transformers import AutoTokenizer
# Iterate 15 T tokens of FineWeb-Edu without downloading the corpus.
ds = load_dataset(
"HuggingFaceFW/fineweb-edu",
name="sample-100BT",
split="train",
streaming=True,
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
def tokenize(batch):
return tok(batch["text"], truncation=True, max_length=8192)
ds = (ds
.shuffle(seed=42, buffer_size=10_000)
.map(tokenize, batched=True, remove_columns=["text"])
.shard(num_shards=world_size, index=rank))
for example in ds:
input_ids = example["input_ids"]
# feed into your trainer
61.3.7 Mapping the data and benchmark stack
61.3.8 Eval suites for production training decisions
The standard practice in 2026 production training is to maintain three layers of evaluation:
- Layer 1: streaming / loss-based metrics. Validation loss on held-out shards of the pretraining corpus, gradient norm, learning rate, throughput. Logged every step to Weights and Biases or MLflow.
- Layer 2: checkpoint evaluations. Run a fast benchmark suite (subset of lm-eval-harness: HellaSwag, ARC, MMLU-low-shot, PIQA, WinoGrande) at every checkpoint. Designed to catch training divergence and to show capability emergence over training.
- Layer 3: release evaluations. Full lm-eval-harness, HELM, custom internal benchmarks, contamination-checked held-out sets. Run only at release candidates.
The cost gradient is roughly 10:100:1000: cheap streaming, moderate checkpoint, expensive release. Production teams budget compute for each layer separately.
A surprising number of training runs are bugged because the held-out validation shard accidentally overlaps with training data, producing artificially-low validation loss that masks overfitting. The defensive practice: deterministically partition the dataset by document hash into train / val / test before any training, never mix; keep the val set small enough to evaluate frequently but never use it to make hyperparameter decisions (use a separate dev set for that); use test only at the end. This is dataset hygiene 101, but at the scale of trillion-token corpora distributed across many shards on multiple object stores, it is also where bugs hide.
61.3.9 Dataset licensing and attribution
An aspect that has become much more important in 2024-2026 with the rise of training-data litigation (NYT v OpenAI, Bartz v Anthropic, the various class actions) is dataset licensing. The practical 2026 landscape:
- Permissively-licensed corpora (FineWeb under ODC-By, RedPajama-v2 under per-source licenses, The Stack v2 under permissive code licenses) are the safest starting point.
- Web-scraped corpora without explicit license (raw CommonCrawl, LAION-5B) are legally murky and often subject to opt-outs (LAION has takedown mechanisms; CommonCrawl respects robots.txt).
- Books and copyrighted content (Books3, the LibGen-derived datasets) are the highest-risk corpora in 2026; major closed labs have moved away from training on them publicly.
- Synthetically-generated data (Alpaca-style self-instruct, ShareGPT-derived) carries risk from the upstream API's terms of service.
Document your dataset provenance and licensing decisions; the legal landscape is evolving and the record-keeping is now part of the engineering work.
61.3.10 Scaling laws and data-budget planning
The training-data side intersects with scaling laws in concrete ways. The Chinchilla scaling law (Hoffmann et al., 2022) found that compute-optimal training mixes parameters and tokens at roughly 1:20 (a 70B model is compute-optimal at ~1.4T tokens). Subsequent work has shown that overtraining past Chinchilla-optimal continues to improve performance, just less compute-efficiently per FLOP. The 2024-2026 frontier models train substantially past Chinchilla-optimal: Llama-3.1 405B at ~15T tokens is ~37 tokens per parameter; Llama-3.1 8B at 15T is ~1875 tokens per parameter. The reasoning is that token-cost is much lower than parameter-cost at deployment (a 405B model is expensive to serve, regardless of training); training the same parameters longer is a cheap insurance policy.
This affects data-curation requirements: at 15T+ token budgets, even FineWeb (15T tokens) and RedPajama-v2 (30T tokens) are not infinite budgets. Repeating tokens (epoching) is increasingly common; the 2024-2026 work (Muennighoff et al., 2023; subsequent FineWeb ablations) suggests up to 4-8 epochs of high-quality data outperforms 1 epoch of mixed-quality data at the same compute. The practical implication: curate aggressively, then repeat the curated corpus, rather than diluting with low-quality data to reach a single-epoch token budget.
What does a "modest" 70B pretrain actually cost in 2026? The Kaplan FLOPs identity $C \approx 6 P T$ for a dense transformer with $P = 70 \times 10^9$ parameters and $T = 1 \times 10^{12}$ tokens gives $C \approx 4.2 \times 10^{23}$ FLOPs. Run on H100 SXM5 at $\pi_{\text{peak}} = 989$ TFLOPS BF16 and an achievable $\mathrm{MFU} = 0.45$, the wall-clock GPU-seconds is $C / (\pi_{\text{peak}} \cdot \mathrm{MFU}) = 4.2 \times 10^{23} / (989 \times 10^{12} \cdot 0.45) \approx 9.43 \times 10^{8}$ GPU-seconds, or $\approx 262{,}000$ GPU-hours. At the 2026 specialized-cloud rate of $2.40 \,\text{USD}/\text{GPU-hour}$ (CoreWeave / Lambda reserved tier) the headline compute bill is $262{,}000 \times 2.40 \approx 629{,}000$ USD.
The headline understates the real cost. Add: (a) data preparation, $\sim \$50$-$150$K for tokenization, dedup, filter, and shuffle of 1T+ tokens on a CPU cluster; (b) ablations and burn-in runs, typically 20-40% of the production-run compute spent on small-scale recipes and 1-hour sanity passes, so $\$130$-$250$K; (c) fault-recovery overhead, even at 90% useful-time (10% checkpoint / restart / failure), is another 10% on the headline, $\$63$K; (d) storage and egress, 6-10 TB of checkpoints + 30-60 TB of dataset at FSx-for-Lustre and S3 rates, $\$30$-$80$K over the run; (e) engineering time, 2-4 senior engineers for 2-3 months at fully-loaded $\$50$K/month, $\$200$-$600$K. The all-in landed cost of a "$629K compute" pretrain is therefore $\$1.5$-$3$M in 2026, which is why "we'll train our own 70B" still requires a venture-capital-class budget rather than a research-grant-class one. Doubling to a 2T-token run roughly doubles compute and storage but barely moves engineering, so the elasticity is more favorable at larger scales.
61.3.11 Canary strings, decontamination, and the test-set integrity problem
The 2024-2026 standard practice for measuring contamination is the canary string: a unique random token sequence embedded in a benchmark's documentation that should never appear in pretraining data. The BIG-Bench canary (a specific UUID-like string) is one example; the lm-evaluation-harness benchmarks increasingly include canaries. The protocol: at evaluation time, the team checks whether the trained model can verbatim-complete the canary; if it can, the benchmark is presumed contaminated and the result is reported with a contamination flag.
The 2024-2026 contamination-checking tools:
- Decon and similar contamination checkers: tools that scan a pretraining corpus for benchmark-test-set n-grams and report the contamination rate. The standard practice is to run decontamination as a filter step in dataset preparation.
- n-gram overlap reports in dataset cards: FineWeb, FineWeb-Edu, and RedPajama-v2 publish per-benchmark contamination rates. The 2026 practice is to require this disclosure.
- Held-out post-cutoff benchmarks: benchmarks created after the pretraining-data cutoff (e.g., LiveBench, the post-2024 GSM8K-Platinum) are contamination-immune by construction. The 2025-2026 leaderboards increasingly emphasize these.
61.3.12 Data cards and provenance documentation
The 2018 "Datasheets for Datasets" paper (Gebru et al.) introduced the data-card concept; the 2024-2026 practice has matured into a near-standard documentation format. The fields that production teams now expect in a data card:
- Source URLs and crawl date ranges: which CommonCrawl snapshots, which web domains included or excluded.
- Filtering procedure: heuristic filters (e.g., n-gram repetition, sentence-length thresholds) and classifier-based filters (with the classifier model identified).
- Deduplication procedure: exact, fuzzy (MinHash with what parameters), document-level versus paragraph-level.
- Language detection: tagged languages and their thresholds.
- License posture: per-source licensing and the aggregate dataset license.
- Decontamination posture: which benchmarks were used for contamination filtering and the residual contamination rate.
- Known issues: documented quality issues, sampling biases, or content concerns.
- Recommended use cases and out-of-scope uses: e.g., "intended for research; commercial use requires per-source license review."
FineWeb, RedPajama-v2, and The Stack v2 ship exemplary data cards; many older corpora (raw CommonCrawl, early LAION) do not, which complicates downstream auditing. Document the provenance of your dataset choices in your training writeup.
61.3.13 Dataset evaluation checklist
When evaluating a corpus for use in training, the questions that surface real differences:
- What is the deduplication procedure? Exact deduplication, MinHash near-dedup, document-level versus paragraph-level?
- What is the quality filtering? Heuristic, classifier-based (which classifier?), or rule-based?
- What is the language detection? Single-language only, multilingual with language tags?
- What is the documented contamination check? Has the dataset been checked against MMLU / GSM8K / HumanEval / etc.?
- What is the licensing posture? Permissively-licensed, derived from licensed sources, or unlicensed web data?
- What is the temporal coverage? Up to when? Post-2024 data has different contamination risk than pre-2024.
- What is the format and on-disk size? Parquet, JSONL, WARC? Total size compressed and uncompressed?
- What is the documentation depth? Datasheet, dataset card, accompanying paper with ablations?
- How does it compose with other datasets? Does it overlap significantly with other corpora you might mix? Need cross-corpus deduplication?
What's Next?
In the next section, Section 61.4: Models, we build on the material covered here.