Datasets & Benchmarks

Section 5.4

The dataset catalogue for Part I is short on purpose. To learn a concept, you want a corpus small enough to load in seconds, well-understood enough that your bug is in your code and not in the data, and large enough that a learning curve is visible. The classic teaching datasets, MNIST, CIFAR-10, SQuAD, GLUE, and a tiny natural-language corpus, satisfy all three criteria. Resist the urge to start with Common Crawl; you will spend more time fighting the loader than studying the model.

The benchmarks listed below also serve a second purpose: they are the reference points that every paper in Parts II through XII compares against. When a 2026 paper claims state-of-the-art on GLUE, it means something specific because GLUE has been frozen since 2018. Knowing the benchmarks at the level of "what is in the test set" lets you read research with calibrated skepticism.

Part I reference datasets by size and task
Figure 5.4.1: The Part I dataset menagerie plotted by size and task. The teaching corpora (MNIST, CIFAR-10, IMDB, GLUE, SQuAD, WikiText-103, TinyStories) all fit comfortably on a laptop. ImageNet-1k and FineWeb-10BT sit at the boundary where Part II's larger machinery becomes necessary. Every dataset here loads with a single load_dataset() call.

5.4.1 Vision datasets

5.4.2 Text datasets

5.4.3 Comparing the teaching datasets

Table 5.4.1a: 6.3.1 Reference datasets for Part I.
Dataset Size Task Why it is in Part I
MNIST 60k images 10-class image classification Smallest non-trivial dataset; classic "first run"
CIFAR-10 50k images 10-class image classification First "real" deep-learning benchmark
SQuAD 2.0 ~150k QA pairs Extractive QA Reference for span-prediction transformers
GLUE ~1M examples across 9 tasks NL understanding suite Standard fine-tuning benchmark
WikiText-103 ~103M tokens Causal LM Compact pretraining-like corpus

5.4.4 Loading them with one line

The datasets library handles all of these:

from datasets import load_dataset
glue_sst2 = load_dataset("glue", "sst2")
squad = load_dataset("rajpurkar/squad_v2")
wikitext = load_dataset("Salesforce/wikitext", "wikitext-103-raw-v1")
Code Fragment 5.4.1b: The datasets library handles all of these:.

For the vision datasets, torchvision.datasets is the right loader: datasets.MNIST(root, train=True, download=True) handles cache and download in one call. The TensorFlow Datasets catalogue mirrors most of the same corpora and is a useful backup when a Hugging Face mirror is rate-limited.

Warning: GLUE leakage in 2026 pretraining corpora

Modern pretrained models have seen the GLUE test set indirectly through web-scale pretraining. A "new SOTA (state-of-the-art) on GLUE" claim in 2026 should be read as "test-set contamination is the most likely explanation" until proven otherwise. The FineWeb-Edu and Dolma data sheets document benchmark filtering, but no commercial frontier model publishes its pretraining corpus, so contamination has to be assumed by default.

Tip: split before you shuffle

The fastest way to convince yourself a model is learning when it is not is to leak the test set into training through a sloppy split. The Hugging Face datasets library exposes train_test_split with a stable seed argument; use it. For SQuAD, GLUE, and the vision datasets, the official splits are pre-defined and you should never rebuild them.

5.4.5 What "good enough" looks like

For each Part I exercise there is a target metric that signals you have a working pipeline: ~99% on MNIST, ~92% on CIFAR-10 with a small ResNet, ~80% on SST-2 with BERT-base, ~85 F1 on SQuAD 2.0 with BERT-base. If you are far below those, the bug is rarely the model and almost always either the optimizer (learning rate, weight decay) or the data pipeline (wrong tokenization, mis-labelled split).

Warning: 99% on MNIST is a smoke test, not a result

Hitting 99% on MNIST tells you the training loop ran. It tells you essentially nothing about how the model would behave on text, on a different image distribution, on adversarial inputs, or at scale. Treat the target metrics in this section as pass / fail signals for your pipeline, not as model-quality claims. The 2025 contamination-resistant suites (BIG-bench Lite and BIG-bench Extra Hard) are the modern equivalent for 2026 LLM evaluation, but you only need them once you reach Part VIII; the cross-link is Section 10.9.

What's Next?

In the next section, Section 5.5: Models, we build on the material covered here.

Further Reading

Foundational Datasets

Gokaslan, A., & Cohen, V. (2019). "OpenWebText Corpus." skylion007.github.io/OpenWebTextCorpus. Open replication of GPT-2's training corpus; the standard small-LM pretraining dataset.
Gao, L., Biderman, S., Black, S., et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027. The reference open pretraining corpus.
Hugging Face (2024). "datasets Library Documentation." huggingface.co/docs/datasets. Reference for streaming and memory-mapped dataset loading.

Benchmarks

Wang, A., Pruksachatkun, Y., Nangia, N., et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. arXiv:1905.00537. Reference NLP-understanding benchmark.
Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding" (MMLU). ICLR 2021. arXiv:2009.03300. The standard multi-task LLM benchmark.