Distributed Training Systems

Chapter opener illustration: Distributed Training Systems.

"Distributed training is what happens when one GPU is not enough and one strategy is not either."

ScaleScale, Distributed-Disciplined AI Agent
Looking Back

Chapter 58 chose the hardware; this chapter coordinates many copies of it. DDP, FSDP, ZeRO, tensor parallelism, pipeline parallelism, sequence parallelism, and the small but consequential question of which sharding strategy fits your model and your interconnect.

Big Picture

Frontier LLMs are trained on clusters of thousands of accelerators. This chapter covers the three orthogonal axes of parallelism (data, tensor, pipeline) and the algorithms that ride on each: ZeRO and FSDP shard data-parallel state so that a 70B model fits on 80 GB GPUs; Megatron-LM partitions a single matrix multiplication across an NVLink-connected node; 1F1B and zero-bubble pipelines move activations between stages with bounded idle time; 3D parallelism composes all three. Section 59.5 then turns to the operational reality of running a 30-day training job across 1000+ GPUs without losing a week to a single failed checkpoint: sharded async checkpoints, elastic schedulers, MFU dashboards, and the public post-mortems from OPT-175B, BLOOM, and Llama-3 that turned each lesson into industry practice.

Chapter Overview

Distributed training is the engineering substrate for any model larger than a single GPU. This chapter walks the three axes of parallelism (data, model, tensor) and their NCCL primitives, ZeRO and FSDP for memory-efficient data parallelism (with per-rank arithmetic for 70B and 405B models), Megatron-LM tensor parallelism and sequence parallelism, pipeline parallelism (GPipe, 1F1B, interleaved, zero-bubble) and the 3D-parallelism recipe used by every frontier run, and production training infrastructure: sharded async checkpointing, elastic schedulers (TorchElastic, SkyPilot, MosaicML), MFU dashboards, and the post-mortems from OPT-175B, BLOOM, and Llama-3.

Distributed training is the part of LLM engineering where a single decision can cost weeks and a single bug can cost a run. This chapter is the practitioner's syllabus for the choices that matter.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

Exercise 59.0.1: Sanity-Check a Parallelism Plan

You are budgeting a 70B-parameter pretraining run on a 64-H100 cluster (8 nodes, 8 GPUs each). The team proposes: TP=4 inside each node, PP=2 across pairs of nodes, DP=8. (a) How many model replicas does this give you, and what is the per-GPU activation memory budget if the model needs 80 GB at TP=1? (b) Identify two failure modes that this plan does not explicitly defend against. (c) Pick one and sketch the diagnostic you would run before kicking off the full run.

Answer Sketch

(a) The product TP times PP equals 8, so each model replica spans 8 GPUs and there are 64 / 8 = 8 replicas (matching DP=8). With TP=4, activations are roughly 80 / 4 = 20 GB per GPU, leaving headroom on an 80 GB H100. (b) Two failure modes: (i) inter-node pipeline bubbles dominating step time because PP crosses a slower NVLink-to-Infiniband boundary; (ii) NCCL collectives at the DP=8 boundary contending with TP all-gathers for the same wire. (c) Diagnostic for (i): run a 100-step micro-benchmark with the pipeline schedule, log per-stage compute time vs bubble time, and compare against the theoretical pipeline efficiency formula. If the bubble fraction exceeds 15 percent, the pipeline scheduling needs to be revisited (e.g., interleaved 1F1B, larger micro-batches).

What's Next?

This chapter begins with Section 59.1: Distributed Training Fundamentals. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

Parallelism Foundations

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv preprint. arXiv:1909.08053. The tensor-parallel matmul decomposition that defines column-parallel and row-parallel layers used in every frontier-scale training stack.
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC '20. arXiv:1910.02054. The optimizer/gradient/parameter sharding scheme that ZeRO-3 and PyTorch FSDP implement; foundational reading for Section 59.2.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., et al. (2019). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." NeurIPS. arXiv:1811.06965. Defined micro-batch pipelining; the conceptual root of GPipe, 1F1B, interleaved, and zero-bubble schedules covered in 59.4.

3D Parallelism and Production Runs

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., et al. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." SC '21. arXiv:2104.04473. The reference 3D-parallelism (TP x PP x DP) study at 1T parameters; the throughput model used in 59.4.
Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB, 16(12). arXiv:2304.11277. The reference FSDP paper; the production framework behind Llama-3 and OPT training, anchoring 59.5 case studies.

Production Postmortems

Le Scao, T., et al. (BigScience). (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint. arXiv:2211.05100. Includes the openly published BLOOM training logbook; the standard reference for distributed-run incident postmortems.