Chapter 06: Pre-training, Scaling Laws & Data Curation

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
Scale, Computationally Devout AI Agent

Pre-training, Scaling Laws and Data Curation chapter illustration — **Figure 6.0.1**: Behind the curtain of modern LLMs lies a story of scale: billions of parameters trained on trillions of tokens, governed by mathematical scaling laws that predict performance before a single GPU is fired up.

Chapter Overview

This chapter takes you behind the curtain of modern language model development. While the Transformer architecture (Chapter 04) provides the blueprint, the real story of LLMs is one of scale: billions of parameters trained on trillions of tokens, consuming thousands of GPU hours. Understanding how this process works is essential for anyone building with or reasoning about these systems.

We begin by surveying the landmark models that shaped the field, from BERT to GPT-4 (see also the Modern LLM Landscape in Chapter 07). We then dissect the pre-training objectives that teach models to understand and generate language. Next, we explore the scaling laws that govern how model performance improves with more compute, data, and parameters, and the data curation pipelines that supply the raw material. We cover the optimization algorithms and distributed training infrastructure that make billion-parameter training feasible (with further inference-time optimization covered in Chapter 09). Finally, we examine the fascinating theoretical question of how in-context learning actually works inside transformers.

Prerequisites

Solid understanding of the Transformer architecture (Chapter 04)
Familiarity with attention mechanisms and positional encodings (Chapter 03)
Basic PyTorch proficiency: training loops, autograd, nn.Module (Chapter 00)
Understanding of tokenization and subword models (Chapter 02)
Comfort with basic probability and information theory (cross-entropy, perplexity)

Learning Objectives

Trace the evolution from BERT to GPT-4, identifying the key architectural and training decisions that defined each era (continued in Chapter 07)
Compare and implement pre-training objectives: causal LM, masked LM, span corruption, fill-in-the-middle, and multi-token prediction
Apply Kaplan and Chinchilla scaling laws to estimate optimal model size and data requirements for a given compute budget
Design a data curation pipeline with deduplication, quality filtering, and domain mixing
Explain how Adam, AdamW, and Adafactor work, and select appropriate learning rate schedules for large-scale training
Distinguish between DDP, FSDP, ZeRO, tensor parallelism, and pipeline parallelism, and select the right strategy for a given hardware setup (see also Chapter 14: Fine-Tuning for applying these in practice)
Discuss leading theories of in-context learning: meta-learning, implicit gradient descent, and task vectors

Sections

What's Next?

In the next section, Section 6.1: The Landmark Models, we trace the landmark models that defined each era of language model development, from BERT and GPT to today's frontier systems.

Bibliography & Further Reading

Foundational Papers

Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arxiv.org/abs/2001.08361

Establishes power-law relationships between model size, dataset size, compute, and loss for Transformer language models.

Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. arxiv.org/abs/2203.15556

The Chinchilla paper showing that most LLMs were significantly undertrained, recommending equal scaling of model and data size.

Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. arxiv.org/abs/1810.04805

Introduces masked language modeling as a pretraining objective, enabling bidirectional representation learning.

Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arxiv.org/abs/2005.14165

The GPT-3 paper demonstrating that scaling to 175B parameters enables strong few-shot and in-context learning without fine-tuning.

Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 21(140), 1–67. arxiv.org/abs/1910.10683

Introduces T5 and the text-to-text framework, systematically comparing pretraining objectives, model sizes, and data strategies.

Penedo, G. et al. (2023). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arxiv.org/abs/2306.01116

Describes a large-scale web data curation pipeline covering deduplication, quality filtering, and content extraction for LLM pretraining.

Rajbhandari, S. et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC 2020. arxiv.org/abs/1910.02054

Introduces the ZeRO optimizer that partitions model states across data-parallel processes, enabling training of very large models.

Key Books

Sutton, R. S. (2019). "The Bitter Lesson." incompleteideas.net/IncIdeas/BitterLesson.html

An influential essay arguing that general methods leveraging computation consistently outperform methods exploiting human knowledge.

Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. arxiv.org/abs/1711.05101

Introduces AdamW, the decoupled weight decay variant of Adam that is now the standard optimizer for LLM pretraining.

Zhao, W. X. et al. (2023). "A Survey of Large Language Models." arxiv.org/abs/2303.18223

A comprehensive survey covering pretraining techniques, scaling laws, alignment methods, and the evolution of LLM families.

Tools & Libraries

DeepSpeed. github.com/microsoft/DeepSpeed

Microsoft's distributed training library implementing ZeRO optimization, pipeline parallelism, and mixed-precision training at scale.

Megatron-LM. github.com/NVIDIA/Megatron-LM

NVIDIA's framework for efficient large-scale Transformer training with tensor parallelism, pipeline parallelism, and sequence parallelism.

PyTorch FSDP (Fully Sharded Data Parallel). pytorch.org/docs/stable/fsdp.html

PyTorch's native implementation of fully sharded data parallelism, enabling memory-efficient distributed training without external libraries.

Hugging Face Datasets. github.com/huggingface/datasets

A library for efficient loading, processing, and streaming of large-scale text datasets used in LLM pretraining pipelines.