Chapter 6: Pretraining, Scaling Laws & Data Curation

Chapter opener illustration: Pre-training.

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
Scale, Computationally Devout AI Agent

Looking Back

Part I built up to a working Transformer. That Transformer needs to be trained on something, and the something is "most of the internet, plus everything we could license or scrape." This chapter zooms out from the architecture to the training recipe: what data goes in (and how it is cleaned), what objective the model optimizes, and the scaling laws that predict performance before you spend a million dollars on compute. Chinchilla, Kaplan, and the Chinchilla-vs-Kaplan reconciliation all live here.

Chapter Overview

This chapter takes you behind the curtain of modern language model development. While the Transformer architecture (Chapter 04) provides the blueprint, the real story of LLMs is one of scale: billions of parameters trained on trillions of tokens, consuming thousands of GPU hours. Understanding how this process works is essential for anyone building with or reasoning about these systems.

We begin by surveying the landmark models that shaped the field, from BERT to GPT-4 (see also the Modern LLM Landscape in Chapter 07). We then dissect the pretraining objectives that teach models to understand and generate language. Next, we explore the scaling laws that govern how model performance improves with more compute, data, and parameters, and the data curation pipelines that supply the raw material. We cover the optimization algorithms and distributed training infrastructure that make billion-parameter training feasible (with further inference-time optimization covered in Chapter 09). Finally, we examine the fascinating theoretical question of how in-context learning actually works inside transformers.

Big Picture

Pretraining is where the foundation model actually gets built. This chapter walks through the data, objectives, scaling laws, and distributed-training systems that make a frontier model possible. Most readers will never pretrain from scratch, but understanding what happens during pretraining is the prerequisite for every downstream decision: which fine-tuning method to choose, why certain failure modes exist, how to plan compute budgets.

Note: Learning Objectives

Trace the evolution from BERT to GPT-4, identifying the key architectural and training decisions that defined each era (continued in Chapter 07)
Compare and implement pretraining objectives: causal LM, masked LM, span corruption, fill-in-the-middle, and multi-token prediction
Apply Kaplan and Chinchilla scaling laws to estimate optimal model size and data requirements for a given compute budget
Design a data curation pipeline with deduplication, quality filtering, and domain mixing
Explain how Adam, AdamW, and Adafactor work, and select appropriate learning rate schedules for large-scale training
Distinguish between DDP, FSDP, ZeRO, tensor parallelism, and pipeline parallelism, and select the right strategy for a given hardware setup (see also Chapter 18: Fine-Tuning for applying these in practice)
Discuss leading theories of in-context learning: meta-learning, implicit gradient descent, and task vectors

Prerequisites

Solid understanding of the Transformer architecture (Chapter 3)
Familiarity with attention mechanisms and positional encodings (Chapter 2)
Basic PyTorch proficiency: training loops, autograd, nn.Module (Chapter 0)
Understanding of tokenization and subword models (Chapter 1)
Comfort with basic probability and information theory (cross-entropy, perplexity)

Sections

What's Next?

Next: Chapter 7: Modern LLM Landscape & Model Internals. Scaling laws tell you what is achievable; Chapter 7 maps what was actually built. We tour the 2026 model zoo (GPT-5, Claude 4, Gemini 2.x, Llama-4, DeepSeek, Qwen) and look inside each at the architectural choices (MoE routing, attention variants, MLA, multimodal fusion) that distinguish frontier APIs from open-weight contenders.