Part II: Understanding LLMs
Chapter 06: Pretraining & Scaling Laws

Pre-training, Scaling Laws & Data Curation

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

Scale Scale, Computationally Devout AI Agent
Pre-training, Scaling Laws and Data Curation chapter illustration
Figure 6.0.1: Behind the curtain of modern LLMs lies a story of scale: billions of parameters trained on trillions of tokens, governed by mathematical scaling laws that predict performance before a single GPU is fired up.

Chapter Overview

This chapter takes you behind the curtain of modern language model development. While the Transformer architecture (Chapter 04) provides the blueprint, the real story of LLMs is one of scale: billions of parameters trained on trillions of tokens, consuming thousands of GPU hours. Understanding how this process works is essential for anyone building with or reasoning about these systems.

We begin by surveying the landmark models that shaped the field, from BERT to GPT-4 (see also the Modern LLM Landscape in Chapter 07). We then dissect the pre-training objectives that teach models to understand and generate language. Next, we explore the scaling laws that govern how model performance improves with more compute, data, and parameters, and the data curation pipelines that supply the raw material. We cover the optimization algorithms and distributed training infrastructure that make billion-parameter training feasible (with further inference-time optimization covered in Chapter 09). Finally, we examine the fascinating theoretical question of how in-context learning actually works inside transformers.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next section, Section 6.1: The Landmark Models, we trace the landmark models that defined each era of language model development, from BERT and GPT to today's frontier systems.

Bibliography & Further Reading

Foundational Papers

Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arxiv.org/abs/2001.08361
Establishes power-law relationships between model size, dataset size, compute, and loss for Transformer language models.
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. arxiv.org/abs/2203.15556
The Chinchilla paper showing that most LLMs were significantly undertrained, recommending equal scaling of model and data size.
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. arxiv.org/abs/1810.04805
Introduces masked language modeling as a pretraining objective, enabling bidirectional representation learning.
Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arxiv.org/abs/2005.14165
The GPT-3 paper demonstrating that scaling to 175B parameters enables strong few-shot and in-context learning without fine-tuning.
Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 21(140), 1–67. arxiv.org/abs/1910.10683
Introduces T5 and the text-to-text framework, systematically comparing pretraining objectives, model sizes, and data strategies.
Penedo, G. et al. (2023). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arxiv.org/abs/2306.01116
Describes a large-scale web data curation pipeline covering deduplication, quality filtering, and content extraction for LLM pretraining.
Rajbhandari, S. et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC 2020. arxiv.org/abs/1910.02054
Introduces the ZeRO optimizer that partitions model states across data-parallel processes, enabling training of very large models.

Key Books

Sutton, R. S. (2019). "The Bitter Lesson." incompleteideas.net/IncIdeas/BitterLesson.html
An influential essay arguing that general methods leveraging computation consistently outperform methods exploiting human knowledge.
Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. arxiv.org/abs/1711.05101
Introduces AdamW, the decoupled weight decay variant of Adam that is now the standard optimizer for LLM pretraining.
Zhao, W. X. et al. (2023). "A Survey of Large Language Models." arxiv.org/abs/2303.18223
A comprehensive survey covering pretraining techniques, scaling laws, alignment methods, and the evolution of LLM families.

Tools & Libraries

Microsoft's distributed training library implementing ZeRO optimization, pipeline parallelism, and mixed-precision training at scale.
NVIDIA's framework for efficient large-scale Transformer training with tensor parallelism, pipeline parallelism, and sequence parallelism.
PyTorch FSDP (Fully Sharded Data Parallel). pytorch.org/docs/stable/fsdp.html
PyTorch's native implementation of fully sharded data parallelism, enabling memory-efficient distributed training without external libraries.
Hugging Face Datasets. github.com/huggingface/datasets
A library for efficient loading, processing, and streaming of large-scale text datasets used in LLM pretraining pipelines.