
"Distributed training is what happens when one GPU is not enough and one strategy is not either."
Scale, Distributed-Disciplined AI Agent
Chapter 58 chose the hardware; this chapter coordinates many copies of it. DDP, FSDP, ZeRO, tensor parallelism, pipeline parallelism, sequence parallelism, and the small but consequential question of which sharding strategy fits your model and your interconnect.
Frontier LLMs are trained on clusters of thousands of accelerators. This chapter covers the three orthogonal axes of parallelism (data, tensor, pipeline) and the algorithms that ride on each: ZeRO and FSDP shard data-parallel state so that a 70B model fits on 80 GB GPUs; Megatron-LM partitions a single matrix multiplication across an NVLink-connected node; 1F1B and zero-bubble pipelines move activations between stages with bounded idle time; 3D parallelism composes all three. Section 59.5 then turns to the operational reality of running a 30-day training job across 1000+ GPUs without losing a week to a single failed checkpoint: sharded async checkpoints, elastic schedulers, MFU dashboards, and the public post-mortems from OPT-175B, BLOOM, and Llama-3 that turned each lesson into industry practice.
Chapter Overview
Distributed training is the engineering substrate for any model larger than a single GPU. This chapter walks the three axes of parallelism (data, model, tensor) and their NCCL primitives, ZeRO and FSDP for memory-efficient data parallelism (with per-rank arithmetic for 70B and 405B models), Megatron-LM tensor parallelism and sequence parallelism, pipeline parallelism (GPipe, 1F1B, interleaved, zero-bubble) and the 3D-parallelism recipe used by every frontier run, and production training infrastructure: sharded async checkpointing, elastic schedulers (TorchElastic, SkyPilot, MosaicML), MFU dashboards, and the post-mortems from OPT-175B, BLOOM, and Llama-3.
Distributed training is the part of LLM engineering where a single decision can cost weeks and a single bug can cost a run. This chapter is the practitioner's syllabus for the choices that matter.
- Explain the three axes of parallelism (data, model, tensor) and their NCCL primitives.
- Apply ZeRO stages 1, 2, 3 and FSDP wrapping policies to a 70B or 405B model with per-rank memory arithmetic.
- Configure Megatron-LM tensor and sequence parallelism within an NVLink fanout.
- Combine GPipe, 1F1B, interleaved, and zero-bubble pipelines into a 3D-parallelism recipe.
- Architect sharded async checkpointing, elastic scheduling, and MFU dashboards for a production training run.
- Diagnose run failures using NCCL flame graphs and post-mortems from OPT-175B, BLOOM, and Llama-3.
Sections in This Chapter
Prerequisites
- Compute planning from Chapter 57
- Pretraining and optimizers from Chapter 6
- PyTorch training internals from Chapter 0
- 59.1 Distributed Training Fundamentals The three axes of parallelism (data, model, tensor), NCCL collective primitives, the NVLink and InfiniBand interconnect hierarchy, and the BSP synchronization model that underlies every training run. Advanced
- 59.2 ZeRO and FSDP: Memory-Efficient Data Parallelism Stages 1, 2, 3 of optimizer / gradient / parameter sharding; the PyTorch FSDP API and wrapping policies; ZeRO++ and Hybrid-Shard; per-rank memory arithmetic for 70B and 405B models. Advanced
- 59.3 Megatron-LM and Tensor Parallelism Column-parallel and row-parallel matmul, attention-head sharding, sequence parallelism via reduce-scatter + all-gather, and the NVLink-fanout upper bound on tensor-parallel degree. Advanced
- 59.4 Pipeline Parallelism and Hybrid Strategies GPipe, 1F1B, interleaved virtual stages, zero-bubble pipelines, and the 3D-parallelism recipe (TP × PP × DP) used by every frontier-scale run from BLOOM to Llama-3. Advanced
- 59.5 Production Training Infrastructure Sharded async checkpointing with optimal cadence, elastic schedulers (TorchElastic, SkyPilot, MosaicML), MFU dashboards and NCCL flame graphs, and real-world post-mortems from OPT-175B, BLOOM, and Llama-3 405B. Advanced
You are budgeting a 70B-parameter pretraining run on a 64-H100 cluster (8 nodes, 8 GPUs each). The team proposes: TP=4 inside each node, PP=2 across pairs of nodes, DP=8. (a) How many model replicas does this give you, and what is the per-GPU activation memory budget if the model needs 80 GB at TP=1? (b) Identify two failure modes that this plan does not explicitly defend against. (c) Pick one and sketch the diagnostic you would run before kicking off the full run.
Answer Sketch
(a) The product TP times PP equals 8, so each model replica spans 8 GPUs and there are 64 / 8 = 8 replicas (matching DP=8). With TP=4, activations are roughly 80 / 4 = 20 GB per GPU, leaving headroom on an 80 GB H100. (b) Two failure modes: (i) inter-node pipeline bubbles dominating step time because PP crosses a slower NVLink-to-Infiniband boundary; (ii) NCCL collectives at the DP=8 boundary contending with TP all-gathers for the same wire. (c) Diagnostic for (i): run a 100-step micro-benchmark with the pipeline schedule, log per-stage compute time vs bubble time, and compare against the theoretical pipeline efficiency formula. If the bubble fraction exceeds 15 percent, the pipeline scheduling needs to be revisited (e.g., interleaved 1F1B, larger micro-batches).
What's Next?
This chapter begins with Section 59.1: Distributed Training Fundamentals. Each section builds on the previous one, so we recommend reading them in order.