Section 6.4: Data Curation at Scale

The internet is humanity's attic: priceless treasures buried under mountains of junk. Data curation is the art of sorting through it all without accidentally teaching your model to sell cryptocurrency.
Scale, Dumpster Diving AI Agent

Prerequisites

This section is largely self-contained. Familiarity with tokenization from Chapter 02 helps for understanding token counts. The discussion of perplexity-based filtering assumes basic language model concepts from Section 1.1.

Big Picture

Data is the foundation of every LLM. While architecture and scaling get the headlines, data quality is often the single largest determinant of model quality. A well-curated 1T token dataset will produce a better model than a poorly filtered 10T token dataset. This section covers the full data curation pipeline: where the data comes from, how duplicates are removed, how quality is assessed, how domain proportions are balanced, and how to handle toxic or private content at web scale. As the scaling laws in Section 6.3 showed, data volume and model size must grow together for optimal training.

1. Pre-training Data Sources

Modern LLMs are trained on a mixture of data drawn from several broad categories. The composition of this mixture profoundly influences the model's strengths and weaknesses.

1. Pre-training Data Sources Comparison

Source	Scale	Quality	Key Datasets
Web crawl	Trillions of tokens	Variable (noisy)	Common Crawl, FineWeb, DCLM
Books	Tens of billions	High	Books3, Project Gutenberg
Code	Hundreds of billions	Medium to high	The Stack, GitHub
Scientific papers	Tens of billions	High	Semantic Scholar, arXiv
Wikipedia	~4B tokens (English)	Very high	Wikipedia dumps
Curated web	Hundreds of billions	High (filtered)	RedPajama, Dolma

Common Crawl is by far the largest publicly available source, containing petabytes of raw HTML from billions of web pages collected since 2008. However, raw Common Crawl is overwhelmingly low quality: advertisements, boilerplate navigation text, spam, pornography, and machine-generated content dominate. Turning this raw crawl into useful training data requires a sophisticated curation pipeline.

Deduplication Saving a Training Run from Memorization

Who: A pre-training team at an AI research lab preparing a 500B token dataset for a 3B parameter model.

Situation: The team assembled data from Common Crawl snapshots spanning 2019 to 2024, along with curated book and code corpora.

Problem: Early training runs showed the model memorizing specific passages verbatim. Evaluation revealed that certain popular web pages (Wikipedia mirrors, Stack Overflow answers, news articles syndicated across hundreds of sites) appeared dozens of times in the dataset.

Dilemma: Aggressive deduplication (exact match) would miss near-duplicates with minor formatting differences. MinHash-based fuzzy deduplication was more thorough but computationally expensive, requiring 2,000 CPU-hours for the full corpus.

Decision: They implemented a two-stage deduplication pipeline: exact URL deduplication first (cheap, removed 18% of documents), followed by MinHash near-deduplication with a Jaccard threshold of 0.8 (removed an additional 22%).

How: They used a distributed MinHash implementation with 128 hash functions across a 64-node Spark cluster, processing the entire corpus in under 8 hours.

Result: The deduplicated dataset was 40% smaller but produced a model with 6% lower perplexity and dramatically reduced memorization (extractable memorized sequences dropped by 73%). Training was also faster due to the smaller dataset.

Lesson: Deduplication is not just a data hygiene step; it directly improves model quality by preventing overfitting to repeated content and encouraging the model to learn generalizable patterns instead.

A clone detector scanning documents to find and remove duplicate or near-duplicate text passages — **Figure 6.4.1**: The deduplication stage: finding and eliminating the clones lurking in your training data before they bias your model.

2. The Data Curation Pipeline

Gold panning metaphor for data curation, sifting through raw web data to find high-quality training examples — **Figure 6.4.2**: Data curation is digital gold panning: you sift through mountains of raw web text to find the nuggets worth training on.

Fun Fact

The Common Crawl dataset contains over 250 billion web pages, which after deduplication, quality filtering, and toxicity removal, typically shrinks to about 1 to 5% of its original size. Training data curation is less "finding a needle in a haystack" and more "the haystack is 95% needles you do not want."

Data curation pipeline: progressive filtering from raw web crawl to training corpus

Figure 6.4.3: A typical data curation pipeline showing the progressive filtering from raw web crawl to final training corpus.

3. Text Extraction

Flowchart of a data curation pipeline showing stages from raw crawl data through filtering, deduplication, and quality scoring — **Figure 6.4.4**: The data curation pipeline in action: raw web crawls enter one end and (hopefully) clean, high-quality training data emerges from the other.

Raw HTML must be converted to clean text before any further processing (similar extraction challenges arise when building retrieval pipelines, as discussed in Section 19.2). This is harder than it sounds. Web pages contain navigation menus, sidebars, advertisements, cookie banners, JavaScript artifacts, and boilerplate footers that vastly outnumber the actual content. Tools like trafilatura and resiliparse use structural heuristics to identify the main content block and strip everything else. The FineWeb project demonstrated that switching from the simple jusText extractor to trafilatura produced measurably better downstream performance.

4. Deduplication

Web crawl data is extraordinarily redundant. The same news article, boilerplate legal text, or copied content may appear thousands of times. Duplicate data wastes training compute, biases the model toward overrepresented content, and increases memorization risk. Deduplication operates at three levels of granularity.

Exact Deduplication

The simplest approach: compute a hash (MD5, SHA-256) of each document and discard exact duplicates. This is fast but misses near-duplicates that differ by a few characters (timestamps, bylines, formatting).

Near-Duplicate Detection with MinHash

MinHash with Locality-Sensitive Hashing (LSH) is the standard technique for finding near-duplicate documents at scale. The core idea: represent each document as a set of n-grams, compute a compact signature (MinHash), and use LSH to efficiently find documents with high Jaccard similarity. Two documents are considered near-duplicates if their Jaccard similarity exceeds a threshold (typically 0.8):

J(A, B) = |A \cap B| / |A \cup B|

where $A$ and $B$ are the n-gram sets of two documents. The MinHash technique approximates this similarity in O(k) time using $k$ hash functions, rather than computing the exact set intersection. Code Fragment 6.4.1 below puts this into practice.


# MinHash + LSH deduplication: hash character n-grams into signatures,
# bucket similar documents together, and flag near-duplicate pairs.
import hashlib
from collections import defaultdict

def get_ngrams(text, n=5):
    """Extract character n-grams from text."""
    words = text.lower().split()
    return set(
        " ".join(words[i:i+n])
        for i in range(len(words) - n + 1)
    )

def minhash_signature(ngrams, num_hashes=128):
    """Compute MinHash signature for a set of n-grams."""
    signature = []
    for i in range(num_hashes):
        min_hash = float('inf')
        for ngram in ngrams:
            # Hash with a different seed for each hash function
            h = int(hashlib.sha256(
                f"{i}:{ngram}".encode()
            ).hexdigest(), 16)
            min_hash = min(min_hash, h)
        signature.append(min_hash)
    return signature

def lsh_buckets(signature, bands=16):
    """Split signature into bands for LSH bucketing."""
    rows_per_band = len(signature) // bands
    buckets = []
    for b in range(bands):
        start = b * rows_per_band
        band_hash = hash(tuple(signature[start:start + rows_per_band]))
        buckets.append((b, band_hash))
    return buckets

# Example: find near-duplicates
docs = [
    "The quick brown fox jumps over the lazy dog in the park",
    "The quick brown fox jumps over a lazy dog in the park",  # near-dup
    "Machine learning models require large amounts of data",
]

bucket_index = defaultdict(list)
for doc_id, doc in enumerate(docs):
    ngrams = get_ngrams(doc, n=3)
    sig = minhash_signature(ngrams, num_hashes=64)
    for bucket in lsh_buckets(sig, bands=8):
        bucket_index[bucket].append(doc_id)

# Find candidate pairs that share a bucket
candidates = set()
for docs_in_bucket in bucket_index.values():
    if len(docs_in_bucket) > 1:
        for i in range(len(docs_in_bucket)):
            for j in range(i+1, len(docs_in_bucket)):
                candidates.add((docs_in_bucket[i], docs_in_bucket[j]))
print(f"Near-duplicate candidates: {candidates}")

Near-duplicate candidates: {(0, 1)}

Code Fragment 6.4.1: Hash with a different seed for each hash function.

Substring-Level Deduplication

Document-level deduplication misses repeated paragraphs that appear across otherwise unique documents (e.g., license headers, terms of service). Substring deduplication uses suffix arrays to find repeated sequences of n or more tokens that appear in multiple documents, then removes all but one occurrence. The RefinedWeb and DCLM datasets demonstrated that substring deduplication consistently improves model quality.

5. Quality Filtering

Key Insight

97% of the internet is not worth training on, and quality filtering is the highest-leverage intervention. A typical web crawl starts at 100+ TB of raw HTML. After deduplication, quality filtering, and domain mixing, the final training dataset is often 3 TB or less. The FineWeb project showed that aggressive quality filtering on Common Crawl can match the performance of curated datasets like C4 and The Pile, despite starting from much noisier raw material. DCLM demonstrated that a well-trained quality classifier can improve benchmark performance by several percentage points over heuristic filtering alone.

After deduplication, the remaining text still varies enormously in quality. Quality filtering separates informative, well-written content from spam, gibberish, and low-effort text. Three complementary strategies are commonly used together.

Heuristic Filters

Rule-based filters are fast and interpretable. Common heuristics include removing documents that are too short (under 100 words), have excessive punctuation or capitalization ratios, contain too many URLs or special characters, have an abnormally low alphabetic character ratio, or have too many repeated lines or paragraphs.

Perplexity-Based Filtering

A small language model (often a KenLM n-gram model trained on Wikipedia) is used to score each document's perplexity. Documents with very high perplexity (incoherent text) or very low perplexity (repetitive boilerplate) are discarded. The CCNet pipeline introduced this approach and demonstrated significant quality improvements.

Classifier-Based Filtering

A binary classifier is trained to distinguish "high-quality" text (e.g., Wikipedia, books) from "low-quality" text (random web samples). The FineWeb-Edu dataset used a quality classifier trained on educational content annotations to produce a subset of FineWeb specifically optimized for knowledge-intensive tasks. Chapter 12 explores how synthetic data generation can complement these curation pipelines when natural data is scarce. DCLM used a fastText classifier trained on references from Wikipedia and OpenWebText to score documents on a quality scale. Code Fragment 6.4.2 below puts this into practice.

# Minimal quality filtering pipeline
import re
from collections import Counter

def heuristic_quality_filter(doc: str) -> dict:
    """Apply heuristic quality filters to a document."""
    words = doc.split()
    lines = doc.strip().split("\n")
    chars = len(doc)
    word_count = len(words)

    # Length check
    if word_count < 50:
        return {"pass": False, "reason": "too_short"}

    # Alphabetic character ratio
    alpha_ratio = sum(c.isalpha() for c in doc) / max(chars, 1)
    if alpha_ratio < 0.6:
        return {"pass": False, "reason": "low_alpha"}

    # Repeated line ratio (boilerplate detection)
    line_counts = Counter(lines)
    repeated = sum(c - 1 for c in line_counts.values() if c > 1)
    if repeated / max(len(lines), 1) > 0.3:
        return {"pass": False, "reason": "repetitive"}

    # URL density (spam detection)
    url_count = len(re.findall(r"https?://", doc))
    if url_count / max(word_count, 1) > 0.1:
        return {"pass": False, "reason": "url_heavy"}

    return {"pass": True, "words": word_count, "alpha": round(alpha_ratio, 3)}

# Test on sample documents
samples = [
    ("Good article", "The transformer architecture has revolutionized NLP by enabling parallel attention. " * 8),
    ("Too short", "Click here now"),
    ("Spam", "Visit https://a.com and https://b.com and https://c.com now " * 5),
    ("Repetitive", ("Buy now!\n" * 20) + "Some padding text to reach minimum length for a valid document check."),
]
for label, doc in samples:
    result = heuristic_quality_filter(doc)
    print(f"{label:>15}: {result}")

Good article: {'pass': True, 'words': 72, 'alpha': 0.901} Too short: {'pass': False, 'reason': 'too_short'} Spam: {'pass': False, 'reason': 'url_heavy'} Repetitive: {'pass': False, 'reason': 'repetitive'}

Code Fragment 6.4.2: Minimal quality filtering pipeline.

Note: FineWeb (Penedo et al., 2024)

The FineWeb dataset from Hugging Face represents the state of the art in open data curation. Starting from 96 Common Crawl snapshots (over 100 TB of raw HTML), the team applied URL filtering, language identification, MinHash deduplication, and quality scoring to produce a 15 trillion token English corpus. Its educational subset, FineWeb-Edu, further classifies documents by educational value using a classifier trained on LLM annotations. Models trained on FineWeb-Edu outperform those trained on full FineWeb by 2 to 4 points on knowledge benchmarks.

Practical Example: HuggingFace datasets for Pre-training Data

The HuggingFace datasets library (pip install datasets) provides streaming access to terabyte-scale pre-training corpora without downloading everything to disk. This is essential when working with datasets like FineWeb that exceed local storage capacity.

# pip install datasets
from datasets import load_dataset

# Stream FineWeb-Edu (15T tokens) without downloading the full dataset
ds = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    name="sample-10BT",           # 10B-token sample for experimentation
    split="train",
    streaming=True,               # essential for large corpora
)

# Inspect a few documents and their quality scores
for i, example in enumerate(ds):
    print(f"Score: {example['score']:.1f} | "
          f"Length: {len(example['text']):,} chars | "
          f"URL: {example['url'][:60]}")
    if i >= 4:
        break

# Filter for high-quality educational content on the fly
high_quality = ds.filter(lambda x: x["score"] >= 4.0)

Code Fragment 6.4.8: pip install datasets

6. Data Mixing and Domain Proportions

The final training corpus is typically a weighted mixture of data from different domains. The proportions of this mixture significantly affect the model's capabilities. For instance, increasing the fraction of code data improves reasoning and structured output abilities. Increasing the fraction of scientific text improves factual knowledge.

The optimal mixing proportions are typically found through ablation experiments on smaller models. DoReMi (Xie et al., 2023) proposed an automated approach: train a small proxy model with uniform mixing, then use the distribution of training loss across domains to reweight the mixture, upsampling domains where the model struggles and downsampling domains that are already well-learned. Code Fragment 6.4.3 below puts this into practice.

# Simplified domain mixing with weighted sampling
import numpy as np

# Domain proportions (must sum to 1.0)
domain_weights = {
    "web":         0.55,
    "code":        0.15,
    "books":       0.10,
    "wikipedia":   0.05,
    "scientific":  0.08,
    "math":        0.04,
    "conversation": 0.03,
}

def sample_batch(domain_weights, batch_size=1024):
    """Sample a training batch according to domain proportions."""
    domains = list(domain_weights.keys())
    probs = list(domain_weights.values())
    # Each sample in the batch comes from a domain
    batch_domains = np.random.choice(domains, size=batch_size, p=probs)
    counts = {d: int((batch_domains == d).sum()) for d in domains}
    return counts

batch = sample_batch(domain_weights)
for domain, count in sorted(batch.items(), key=lambda x: -x[1]):
    bar = "#" * (count // 10)
    print(f"  {domain:<15} {count:4d} samples  {bar}")

web 563 samples ######################################################## code 154 samples ############### books 100 samples ########## scientific 82 samples ######## wikipedia 50 samples ##### math 33 samples ### conversation 42 samples ####

Code Fragment 6.4.3: Simplified domain mixing with weighted sampling.

7. Toxicity and PII Removal

Pre-training data must be filtered for toxic content (hate speech, explicit material, harassment) and personally identifiable information (PII) such as phone numbers, email addresses, and social security numbers. The broader ethical considerations of data sourcing and model safety are covered in Chapter 26. Toxicity classifiers like the Jigsaw Perspective API or custom fastText models flag documents above a toxicity threshold. PII removal typically uses regex patterns for structured identifiers combined with named entity recognition for names and addresses.

Warning: Safety vs. Capability

Overly aggressive toxicity filtering can remove legitimate content about sensitive topics (medical discussions, legal cases, historical events) and hurt the model's ability to understand and reason about these subjects. Most pipelines use a threshold rather than a binary filter, removing only the most toxic content while preserving borderline cases.

8. Data Pruning and Influence Functions

Beyond filtering for quality, recent research explores selecting the most informative training examples. Data pruning removes redundant or uninformative samples to train on a smaller, higher-quality subset without sacrificing performance.

Influence functions provide a principled approach: they estimate how much each training example contributes to the model's performance on a validation set. Formally, the influence of training example $z_{i}$ on the loss at validation point $z_{test}$ is:

I(z_{i}, z_{test}) = - \nabla _{ \theta }L(z_{test})^{T} H_{ \theta }^{-1} \nabla _{ \theta }L(z_{i})

where $H_{ \theta }$ is the Hessian of the training loss. Computing exact influence functions is prohibitively expensive for large models (the Hessian has $N^{2}$ entries), so practical approaches use approximations such as LiSSA (Linear time Stochastic Second-Order Algorithm) or track gradient statistics during training as proxies for influence.

Data Quality Over Quantity: The "Textbooks Are All You Need" Paradigm

The scaling laws from Kaplan et al. and Chinchilla established that more data generally leads to better models. But a parallel line of research has demonstrated that data quality can partially substitute for data quantity, producing surprisingly capable models at a fraction of the expected parameter count. The most striking evidence comes from Microsoft's Phi model family.

The original Phi-1 (Gunasekar et al., 2023) was a 1.3B parameter model trained on only 7B tokens of carefully curated "textbook quality" code data, plus 1B tokens of synthetically generated exercises. Despite its tiny scale, Phi-1 achieved 50.6% on HumanEval (a code generation benchmark), matching or exceeding models 10x its size that were trained on orders of magnitude more data. The key insight, articulated in the paper's title "Textbooks Are All You Need," was that a small amount of high-quality, pedagogically structured data could teach reasoning patterns more efficiently than massive volumes of noisy web scrapes.

Phi-1.5 extended the approach to general natural language, using a combination of curated web data and synthetic "textbook" content generated by GPT-3.5. The synthetic data was specifically designed to include step-by-step explanations, worked examples, and structured reasoning, the kind of content found in good textbooks rather than typical web pages. Phi-1.5 (1.3B parameters) achieved performance comparable to models 5 to 10 times its size on reasoning benchmarks.

The later Phi-3 (3.8B parameters) and Phi-4 (14B parameters) continued the philosophy at larger scales, combining heavily curated web data with synthetic data generated through increasingly sophisticated pipelines. Phi-3 Mini, at just 3.8B parameters, matched Mixtral 8x7B (46.7B total parameters) on several reasoning benchmarks. Phi-4 introduced "pivotal token search," a technique that identifies tokens where the model's prediction is most uncertain and uses those to generate targeted training examples that address specific knowledge gaps.

The Phi results have influenced how the broader community thinks about pre-training data. The FineWeb-Edu dataset from Hugging Face applies a similar philosophy at larger scale: it uses a classifier trained on educational content to score web pages, retaining only the most informative subset. Models trained on FineWeb-Edu consistently outperform those trained on the same volume of unfiltered data. The practical takeaway is that investing in data quality (through curation, filtering, or synthetic generation) often yields better returns than simply scaling data volume. This is especially relevant as the supply of high-quality natural text approaches its limits.

9. Synthetic Data for Pre-training

As high-quality natural text becomes scarce, synthetic data generated by LLMs themselves has become an increasingly important part of the pre-training pipeline. The key insight is that a capable model can generate high-quality training data for a less capable model (or even for future training stages of itself).

Microsoft's Phi-3 and Phi-4 models demonstrated this powerfully: they were trained substantially on synthetic "textbook-quality" data generated by larger models. Despite being small (3.8B parameters), Phi-3 rivaled models many times its size on reasoning benchmarks, largely because its training data was exceptionally high quality. The key was generating structured, educational content (explanations, worked examples, reasoning chains) rather than simply paraphrasing web text.

Quality control for synthetic data is critical. Without careful filtering, model-generated text can amplify biases, introduce subtle errors, or create distribution collapse (where the synthetic data converges to a narrow mode of the generating model). Effective strategies include:

Verification: Use a separate model or tool to check factual claims in generated text
Diversity enforcement: Vary prompts, topics, and generation parameters to prevent mode collapse
Human curation: Use human reviewers to audit a sample of synthetic data for quality
Mixing: Blend synthetic data with natural data rather than using it exclusively

Research Frontier: The Data Wall

Epoch AI projects that high-quality text data available on the public internet will be effectively exhausted by 2026 to 2028. Current large models already train on significant fractions of all available web text. This "data wall" is driving three parallel responses: (1) synthetic data generation at scale, (2) multimodal training that incorporates images, video, and audio alongside text, and (3) improved data efficiency through better curation and deduplication. Legal challenges around copyrighted training data add further pressure. How the field navigates this constraint will shape the next generation of LLMs.

Key Insight: Data Curation as a Form of Inductive Bias

Data curation is often treated as an engineering task, but it is actually a form of implicit inductive bias no less important than architectural choices. When you filter out low-quality web pages using a Wikipedia-trained classifier, you are encoding a prior about what "good text" looks like. When you set domain mixing ratios (60% web, 20% code, 10% academic, 10% books), you are defining the model's prior distribution over knowledge types. The Phi series demonstrated that aggressive curation bias toward high-quality educational text can substitute for an order of magnitude in model parameters. This connects to a fundamental result in learning theory: the no-free-lunch theorem guarantees that no learning algorithm outperforms all others across all possible distributions. Every model must make assumptions, and data curation is how those assumptions are encoded at the data level rather than the architecture level. The debate between "more data" and "better data" is ultimately a debate about where to invest in inductive bias.

10. Major Open Datasets

10. Major Open Datasets Comparison

Dataset	Size	Key Feature
The Pile	825 GB	22 diverse subsets, academic focus
RedPajama v2	30T tokens (raw)	Open reproduction of Llama data
FineWeb	15T tokens	Best open Common Crawl processing
FineWeb-Edu	1.3T tokens	Educational quality subset
DCLM	4T tokens	Classifier-curated, strong baselines
Dolma	3T tokens	Open, for OLMo model family
The Stack v2	67.5TB code	Permissively licensed source code

Self-Check

1. Why is deduplication critical for LLM training quality?

Show Answer

Duplicate data wastes training compute by showing the model the same content multiple times without new information. It also biases the model toward memorizing overrepresented text (especially templated content like legal boilerplate), increases the risk of verbatim memorization (a privacy and copyright concern), and distorts the effective domain mixture. Deduplication at both document and substring levels consistently improves downstream benchmark performance.

2. How does MinHash approximate Jaccard similarity?

Show Answer

MinHash exploits the mathematical property that the probability of two sets having the same minimum hash value under a random hash function equals their Jaccard similarity (|A ∩ B| / |A ∪ B|). By applying many independent hash functions and recording the minimum hash for each, you get a compact signature. The fraction of matching entries between two signatures estimates their Jaccard similarity. LSH then groups documents into buckets by banding the signature, so only documents with high similarity end up in the same bucket, avoiding the need for all-pairs comparison.

3. Explain the tradeoff between aggressive quality filtering and data diversity.

Show Answer

Aggressive quality filtering (e.g., keeping only Wikipedia-like text) produces very clean data but reduces diversity. The model may become excellent at formal, encyclopedic prose but struggle with informal text, dialogue, slang, technical jargon, or code. It can also introduce bias by favoring well-represented languages, topics, and perspectives. The best pipelines use tiered filtering: remove clearly low-quality content with heuristics, then apply softer quality scoring rather than hard thresholds, preserving a range of styles and topics.

4. Why are influence functions impractical to compute exactly for modern LLMs?

Show Answer

Influence functions require computing the inverse Hessian matrix of the training loss, which has dimensions N x N where N is the number of model parameters. For a 7B parameter model, this matrix would have 49 x 10¹⁸ entries, requiring exabytes of memory. Even the Hessian-vector products needed for iterative approximations are expensive at this scale. Practical alternatives use gradient similarity proxies, track loss statistics during training, or apply influence function approximations to smaller proxy models and transfer the findings.

Key Takeaways

Data quality dominates data quantity. Aggressive curation of a smaller dataset often outperforms naive scaling of a larger one.
Deduplication at document and substring levels is essential. MinHash with LSH is the standard for finding near-duplicates at scale.
Quality filtering combines heuristic rules, perplexity scoring, and trained classifiers. Classifier-based filtering (FineWeb-Edu, DCLM) provides the largest quality improvements.
Domain mixing proportions directly shape model capabilities. Increasing code and math data improves reasoning; over-representing web text dilutes specialized knowledge.
Toxicity and PII removal require careful calibration to avoid removing legitimate content about sensitive topics.
Data pruning and influence functions represent the frontier of data curation, selecting the most informative examples for training.

Research Frontier

Model-guided data selection. Traditional data curation relies on heuristic filters and human-crafted rules. The frontier is moving toward model-based curation, where a trained model scores and selects its own training data. DCLM (DataComp for Language Models, 2024) showed that training a simple classifier on quality examples can filter web data more effectively than any heuristic pipeline. Influence functions and data attribution methods (TRAK, Datamodels) take this further by identifying which training examples most affect specific model behaviors, enabling targeted data curation for capability development. Synthetic data generation (Section 12.1) is also emerging as a complement to web-scraped data, particularly for reasoning and instruction-following capabilities.

What's Next?

In the next section, Section 6.5: Optimizers and Training Dynamics, we examine the optimizers and training dynamics that make stable large-scale training possible.

References & Further Reading

Open Datasets & Curation Pipelines

Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv preprint arXiv:2406.17557.

Documents Hugging Face's FineWeb pipeline for extracting high-quality text from Common Crawl. Provides detailed ablations of filtering steps (language ID, quality scoring, deduplication) and their impact on downstream model performance.

📊 Dataset

Soldaini, L. et al. (2024). "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024.

Introduces the Dolma corpus with fully documented provenance, filtering decisions, and mixing ratios. A reference implementation for transparent, reproducible data curation at trillion-token scale.

📊 Dataset

Deduplication & Data Quality

Lee, K. et al. (2022). "Deduplicating Training Data Makes Language Models Better." ACL 2022.

Demonstrates that exact and near-duplicate removal from training data reduces memorization, lowers perplexity, and improves generation quality. Introduces scalable MinHash-based deduplication applicable to web-scale corpora.

📄 Paper

Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR 2023.

Measures how much training data language models memorize verbatim, finding that memorization scales with model size and data duplication. Essential reading for understanding privacy risks and the importance of deduplication.

📄 Paper

Data Mixing & Selection Strategies

Longpre, S. et al. (2024). "A Pretrainer's Guide to Training Data." arXiv preprint arXiv:2305.13169.

A comprehensive survey of data curation decisions including source selection, filtering, deduplication, and mixing. Provides practical guidance for practitioners building pre-training pipelines from scratch.

📄 Paper

Xie, S. M. et al. (2024). "DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining." NeurIPS 2024.

Uses a small proxy model to learn optimal domain mixing weights that minimize worst-case loss across domains. Shows that learned data mixtures can match the performance of hand-tuned ratios while being more principled and reproducible.

📄 Paper