Appendix J: Datasets, Benchmarks, and Leaderboards

A grand Olympic arena where AI model characters compete on benchmark tracks, with dataset trophies on display and leaderboard screens showing colorful rankings

Big Picture

This appendix is a reference guide to the data that trains, evaluates, and measures large language models. It covers major pretraining corpora (The Pile, C4, RedPajama, ROOTS, and others), instruction and alignment datasets (Alpaca, FLAN, ShareGPT, OpenAssistant), evaluation benchmarks (MMLU, HumanEval, BIG-Bench, MT-Bench, HELM), a comparative benchmark summary table, and licensing considerations for datasets used in commercial applications.

Data is the substrate of every LLM capability and limitation. Understanding what a model was trained on explains both what it knows and what it does not. Understanding the benchmarks used to evaluate it reveals what "state-of-the-art" actually measures and where reported numbers should be treated with skepticism. Practitioners who can read a model card's training data and benchmark results critically are better equipped to select models, interpret published claims, and design their own evaluations.

This appendix is relevant for researchers designing experiments, engineers selecting base models for fine-tuning, and anyone who needs to choose evaluation benchmarks or understand what published benchmark numbers mean in practical terms. It is also an essential companion for fine-tuning projects where dataset selection and licensing must be verified before commercial deployment.

Pretraining datasets directly inform the scaling laws and data efficiency concepts in Chapter 6 (Pretraining and Scaling Laws). Fine-tuning datasets are central to Chapter 14 (Fine-Tuning Fundamentals). The benchmarks here are used to interpret results in Chapter 29 (Evaluation). Model cards that cite these benchmarks are covered in Appendix H (Model Cards and Selection Guide).

Prerequisites

No specific prerequisites are required to browse this appendix as a reference. Understanding how pretraining works (covered in Chapter 6) will help you interpret the pretraining dataset entries. For instruction-tuning datasets, familiarity with fine-tuning concepts from Chapter 14 provides useful context.

When to Use This Appendix

Consult this appendix when selecting a base model and wanting to understand what data it was trained on, when choosing benchmarks for your own evaluation suite, or when verifying dataset licensing before including data in a commercial fine-tuning pipeline. Section J.5 (Dataset Licensing Considerations) is particularly important before any commercial deployment. If you are comparing published model results across papers, Section J.3 and J.4 provide the context needed to interpret what those benchmark scores actually measure and where comparisons break down due to different evaluation protocols.

Sections