Section J.1: Major Pretraining Datasets

Pretraining a language model requires vast quantities of text. The datasets below represent the most widely used and studied sources for LLM pretraining. Each has different strengths, filtering methodologies, and licensing terms.

Common Crawl

Description	A nonprofit organization that crawls the web monthly and freely distributes the archive. Common Crawl is the upstream source for most pretraining datasets; others (C4, The Pile, FineWeb, RedPajama) are curated subsets or derivatives of it.
Size	Petabytes of raw HTML; each monthly crawl captures ~3 billion web pages
Format	WARC (Web ARChive) files containing raw HTML, plus extracted text (WET files)
Access	Free download from commoncrawl.org (hosted on AWS S3 as a public dataset)
License	Open The crawl data itself is freely available; individual web pages retain their original copyright
Limitations	Raw data contains significant noise: boilerplate HTML, navigation menus, spam, adult content, and duplicated text. Extensive filtering is required before use in training.

The Pile

Description	A curated 825 GB English text dataset created by EleutherAI, composed of 22 diverse sub-datasets including books, academic papers (PubMed, ArXiv), code (GitHub), legal text (FreeLaw), and web text (Pile-CC). Designed for diversity rather than pure scale.
Size	825 GB (approximately 300 billion tokens)
Format	JSONL (one document per line, with text and metadata)
Access	Hugging Face Hub (EleutherAI/pile), also available via the Eye
License	Mixed Component datasets have varying licenses; some sub-datasets have been legally challenged (Books3)
Limitations	The Books3 subset was removed from public distribution after copyright disputes. Some sub-datasets may contain personal information. Not updated since 2021.

RedPajama v2

Description	A large-scale, open dataset created by Together AI to support transparent LLM development. Includes over 100 billion text documents with quality signals and pre-computed metadata for filtering.
Size	~30 trillion tokens (raw); quality-filtered subsets are smaller
Format	JSONL with quality annotations and metadata
Access	Hugging Face Hub (togethercomputer/RedPajama-Data-V2)
License	Open Apache 2.0 for the dataset tooling; underlying content has original copyright
Limitations	Quality filtering is left to the user, requiring careful selection of quality thresholds. Predominantly English, with limited multilingual coverage.

FineWeb / FineWeb-Edu

Description	A high-quality web dataset created by Hugging Face, derived from Common Crawl with extensive deduplication and quality filtering. FineWeb-Edu is a subset filtered for educational content using a classifier, yielding exceptionally high-quality training data.
Size	FineWeb: 15 trillion tokens; FineWeb-Edu: 1.3 trillion tokens
Format	Parquet files with text and metadata
Access	Hugging Face Hub (HuggingFaceFW/fineweb, HuggingFaceFW/fineweb-edu)
License	Open ODC-By 1.0
Limitations	Primarily English. The educational classifier may introduce biases in what counts as "educational" content. Still derived from web crawls, so some noise persists.

DCLM (DataComp for Language Models)

Description	A benchmark and dataset initiative that treats data curation as a systematic competition. Teams propose filtering strategies for a fixed pool of Common Crawl data, and the resulting models are evaluated to find optimal data recipes.
Size	Pool: ~240 trillion tokens from Common Crawl; curated baselines range from 2 to 4 trillion tokens
Format	Parquet files with text and metadata
Access	Hugging Face Hub (mlfoundations/dclm-pool, mlfoundations/dclm-baseline)
License	Open Various open licenses
Limitations	Focused on English text. The competition framework evaluates data quality only through downstream model performance on standard benchmarks, which may not capture all quality dimensions.

The Trend Toward Data Quality

The trajectory from Common Crawl to FineWeb-Edu reflects a major shift in the field: the realization that data quality matters more than data quantity. Chinchilla scaling laws (Section 6.3) showed that more tokens help, but subsequent work demonstrated that a smaller set of carefully filtered tokens can outperform a larger noisy set. Modern pretraining recipes invest heavily in deduplication, quality classification, and content filtering.

datasets in Practice

Load and stream a large pretraining dataset from the Hugging Face Hub.

# pip install datasets
from datasets import load_dataset

# Stream FineWeb-Edu to avoid downloading the full 1.3T tokens
ds = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    split="train",
    streaming=True
)
sample = next(iter(ds))
print(sample["text"][:300])
print(f"Keys: {list(sample.keys())}")

Code Fragment 1: Streaming the FineWeb-Edu dataset with streaming=True to avoid downloading the full 1.3 trillion tokens. The iterator yields one example at a time, making it feasible to inspect massive datasets on any machine.