Pretraining a language model requires vast quantities of text. The datasets below represent the most widely used and studied sources for LLM pretraining. Each has different strengths, filtering methodologies, and licensing terms.
Common Crawl
| Description | A nonprofit organization that crawls the web monthly and freely distributes the archive. Common Crawl is the upstream source for most pretraining datasets; others (C4, The Pile, FineWeb, RedPajama) are curated subsets or derivatives of it. |
| Size | Petabytes of raw HTML; each monthly crawl captures ~3 billion web pages |
| Format | WARC (Web ARChive) files containing raw HTML, plus extracted text (WET files) |
| Access | Free download from commoncrawl.org (hosted on AWS S3 as a public dataset) |
| License | Open The crawl data itself is freely available; individual web pages retain their original copyright |
| Limitations | Raw data contains significant noise: boilerplate HTML, navigation menus, spam, adult content, and duplicated text. Extensive filtering is required before use in training. |
The Pile
| Description | A curated 825 GB English text dataset created by EleutherAI, composed of 22 diverse sub-datasets including books, academic papers (PubMed, ArXiv), code (GitHub), legal text (FreeLaw), and web text (Pile-CC). Designed for diversity rather than pure scale. |
| Size | 825 GB (approximately 300 billion tokens) |
| Format | JSONL (one document per line, with text and metadata) |
| Access | Hugging Face Hub (EleutherAI/pile), also available via the Eye |
| License | Mixed Component datasets have varying licenses; some sub-datasets have been legally challenged (Books3) |
| Limitations | The Books3 subset was removed from public distribution after copyright disputes. Some sub-datasets may contain personal information. Not updated since 2021. |
RedPajama v2
| Description | A large-scale, open dataset created by Together AI to support transparent LLM development. Includes over 100 billion text documents with quality signals and pre-computed metadata for filtering. |
| Size | ~30 trillion tokens (raw); quality-filtered subsets are smaller |
| Format | JSONL with quality annotations and metadata |
| Access | Hugging Face Hub (togethercomputer/RedPajama-Data-V2) |
| License | Open Apache 2.0 for the dataset tooling; underlying content has original copyright |
| Limitations | Quality filtering is left to the user, requiring careful selection of quality thresholds. Predominantly English, with limited multilingual coverage. |
FineWeb / FineWeb-Edu
| Description | A high-quality web dataset created by Hugging Face, derived from Common Crawl with extensive deduplication and quality filtering. FineWeb-Edu is a subset filtered for educational content using a classifier, yielding exceptionally high-quality training data. |
| Size | FineWeb: 15 trillion tokens; FineWeb-Edu: 1.3 trillion tokens |
| Format | Parquet files with text and metadata |
| Access | Hugging Face Hub (HuggingFaceFW/fineweb, HuggingFaceFW/fineweb-edu) |
| License | Open ODC-By 1.0 |
| Limitations | Primarily English. The educational classifier may introduce biases in what counts as "educational" content. Still derived from web crawls, so some noise persists. |
DCLM (DataComp for Language Models)
| Description | A benchmark and dataset initiative that treats data curation as a systematic competition. Teams propose filtering strategies for a fixed pool of Common Crawl data, and the resulting models are evaluated to find optimal data recipes. |
| Size | Pool: ~240 trillion tokens from Common Crawl; curated baselines range from 2 to 4 trillion tokens |
| Format | Parquet files with text and metadata |
| Access | Hugging Face Hub (mlfoundations/dclm-pool, mlfoundations/dclm-baseline) |
| License | Open Various open licenses |
| Limitations | Focused on English text. The competition framework evaluates data quality only through downstream model performance on standard benchmarks, which may not capture all quality dimensions. |
The Trend Toward Data Quality
The trajectory from Common Crawl to FineWeb-Edu reflects a major shift in the field: the realization that data quality matters more than data quantity. Chinchilla scaling laws (Section 6.3) showed that more tokens help, but subsequent work demonstrated that a smaller set of carefully filtered tokens can outperform a larger noisy set. Modern pretraining recipes invest heavily in deduplication, quality classification, and content filtering.
datasets in Practice
Load and stream a large pretraining dataset from the Hugging Face Hub.
# pip install datasets
from datasets import load_dataset
# Stream FineWeb-Edu to avoid downloading the full 1.3T tokens
ds = load_dataset(
"HuggingFaceFW/fineweb-edu",
split="train",
streaming=True
)
sample = next(iter(ds))
print(sample["text"][:300])
print(f"Keys: {list(sample.keys())}")
Code Fragment 1: Streaming the FineWeb-Edu dataset with
streaming=True to avoid downloading the full 1.3 trillion tokens. The iterator yields one example at a time, making it feasible to inspect massive datasets on any machine.