Pretraining, Scaling Laws & Data Curation

Chapter opener illustration: Pre-training.

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

ScaleScale, Computationally Devout AI Agent
Looking Back

Part I built up to a working Transformer. That Transformer needs to be trained on something, and the something is "most of the internet, plus everything we could license or scrape." This chapter zooms out from the architecture to the training recipe: what data goes in (and how it is cleaned), what objective the model optimizes, and the scaling laws that predict performance before you spend a million dollars on compute. Chinchilla, Kaplan, and the Chinchilla-vs-Kaplan reconciliation all live here.

Chapter Overview

This chapter takes you behind the curtain of modern language model development. While the Transformer architecture (Chapter 04) provides the blueprint, the real story of LLMs is one of scale: billions of parameters trained on trillions of tokens, consuming thousands of GPU hours. Understanding how this process works is essential for anyone building with or reasoning about these systems.

We begin by surveying the landmark models that shaped the field, from BERT to GPT-4 (see also the Modern LLM Landscape in Chapter 07). We then dissect the pretraining objectives that teach models to understand and generate language. Next, we explore the scaling laws that govern how model performance improves with more compute, data, and parameters, and the data curation pipelines that supply the raw material. We cover the optimization algorithms and distributed training infrastructure that make billion-parameter training feasible (with further inference-time optimization covered in Chapter 09). Finally, we examine the fascinating theoretical question of how in-context learning actually works inside transformers.

Big Picture

Pretraining is where the foundation model actually gets built. This chapter walks through the data, objectives, scaling laws, and distributed-training systems that make a frontier model possible. Most readers will never pretrain from scratch, but understanding what happens during pretraining is the prerequisite for every downstream decision: which fine-tuning method to choose, why certain failure modes exist, how to plan compute budgets.

Note: Learning Objectives

Prerequisites

Sections

What's Next?

Next: Chapter 7: Modern LLM Landscape & Model Internals. Scaling laws tell you what is achievable; Chapter 7 maps what was actually built. We tour the 2026 model zoo (GPT-5, Claude 4, Gemini 2.x, Llama-4, DeepSeek, Qwen) and look inside each at the architectural choices (MoE routing, attention variants, MLA, multimodal fusion) that distinguish frontier APIs from open-weight contenders.