Distributed Training Deep Dive
The practitioner reference for choosing and launching a parallelism strategy. Conceptual treatment of why scale matters is in Chapter 6 (Pretraining and Scaling Laws); this section is the engineering counterpart: which torchrun, accelerate, or deepspeed command to run, what changes in the script, and what the memory math says about whether your job will fit.
The five strategies here (DDP Distributed Data Parallel, FSDP Fully Sharded Data Parallel, ZeRO-1/2/3 the Zero Redundancy Optimizer from DeepSpeed, plus pipeline and tensor parallelism) compose into the 3D parallelism used to pretrain frontier models. We walk the ladder from the simplest case (model fits on one GPU, scale by adding ranks) up to the hardest (a 70B+ model sliced along layers and tensor dimensions across hundreds of GPUs). At each rung we cover the memory math, the launcher command, and the failure modes you actually hit. Cross-references to Section 19.11 (Libraries & Frameworks) for run logging and Section 10.7 (Libraries & Frameworks) for what happens after training are flagged inline.
Prerequisites
This section assumes basic familiarity with single-GPU PyTorch training (forward pass, backward pass, optimizer step) from Section 0.3, the Adam optimizer (which stores two running statistics per parameter), mixed-precision training (bf16 = brain-float-16), and the transformer block structure from Chapter 3. NCCL (NVIDIA Collective Communications Library) is the GPU-to-GPU communication backend that the strategies below rely on for fast all-reduce and all-gather operations.
N.1.1 The Memory Math: Why You Need Parallelism at All
Before choosing a strategy you need to know whether your job fits. For an LLM in mixed precision (bf16 weights, fp32 optimizer state), Adam needs roughly 20 bytes per parameter: 2 for the bf16 weight, 4 for the fp32 master, 4 for the gradient, 8 for the two Adam moments. A 7B model therefore needs ~140 GB of static optimizer-state memory before any activation; an 80 GB H100 cannot hold it without sharding. Activations are the second variable: a 7B Llama at seqlen 4096, batch 4 per GPU adds ~30-40 GB on top. Activation checkpointing (recompute on backward) trades ~30% extra FLOPs for a 4-8x activation-memory reduction and is on by default in modern scripts.
The right question is not "do I have enough GPUs" but "what is my static memory floor and what is my activation budget". DDP replicates the static memory on every GPU; FSDP and ZeRO-3 shard it. If your model's static state exceeds one GPU's VRAM, DDP is not an option no matter how many GPUs you have.
N.1.2 Distributed Data Parallel (DDP): The Simplest Case
DDP is the default when the model fits on a single GPU and you want to scale throughput by processing more data per step. Each rank holds a full replica, processes a different mini-batch, and synchronizes gradients via an all-reduce after the backward pass. The PyTorch implementation in torch.distributed uses NCCL on NVIDIA hardware and is the baseline against which every other strategy is measured. The script reads its rank from environment variables and wraps the model in DistributedDataParallel:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from transformers import AutoModelForCausalLM, AutoTokenizer
def main():
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16
).to(local_rank)
model = DDP(model, device_ids=[local_rank])
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
# ... training loop with DistributedSampler ...
dist.destroy_process_group()
if __name__ == "__main__":
main()
Launch with torchrun --nproc-per-node=8 train.py on a single node, or torchrun --nnodes=4 --nproc-per-node=8 --rdzv-backend=c10d --rdzv-endpoint=$MASTER:29500 train.py for multi-node. The torchrun docs describe elastic restart semantics; SageMaker and Azure ML wrap this with rendezvous services so you do not maintain the cluster yourself.
N.1.3 Fully Sharded Data Parallel (FSDP)
When the model no longer fits on one GPU, the cheapest upgrade is FSDP (FSDP2, the rewritten API, stable in PyTorch 2.4 (2024)). FSDP shards weights, gradients, and optimizer states across ranks; each rank gathers parameters for the current layer, computes, then frees them. Two strategies: FULL_SHARD (lowest memory) and HYBRID_SHARD (full shard within a node, replicate across nodes), the default for bandwidth-limited fabrics:
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
import functools, torch
mp = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={LlamaDecoderLayer},
)
model = FSDP(
model,
sharding_strategy=ShardingStrategy.HYBRID_SHARD, # FULL_SHARD for single node
auto_wrap_policy=wrap_policy,
mixed_precision=mp,
device_id=torch.cuda.current_device(),
use_orig_params=True,
)
The auto_wrap_policy matters more than people realise: wrap at the transformer-block level for clean compute-communication overlap. Wrap too coarsely (whole model) and you defeat sharding; wrap too finely (every nn.Linear) and you flood NCCL with tiny collectives. The Hugging Face Accelerate FSDP guide (updated 2025) documents the wrap policies for the major model families.
N.1.4 DeepSpeed ZeRO: Three Stages and Two Offloads
ZeRO (Zero Redundancy Optimizer) from Microsoft predates FSDP and remains the standard when you need offload to CPU or NVMe. ZeRO-1 shards optimizer states; ZeRO-2 also shards gradients; ZeRO-3 also shards parameters (equivalent to FSDP FULL_SHARD). ZeRO-Infinity spills optimizer states and parameters to NVMe, letting a single node train a 100B+ model at the cost of slower steps. The DeepSpeed Ulysses work (March 2024) added sequence parallelism for long-context training, enabling million-token-context fine-tuning on commodity clusters. DeepSpeed remains the project of choice when you need to mix ZeRO with pipeline parallelism in the same run.
N.1.5 Pipeline and Tensor Parallelism: When Sharding is Not Enough
Beyond roughly 30B parameters, sharding alone runs out of room. Pipeline parallelism cuts the model into stages along its depth and streams micro-batches through them; the GPipe scheme and its 1F1B refinement are built into PyTorch as torch.distributed.pipelining (stable in PyTorch 2.5, late 2024). Tensor parallelism splits individual matrices across GPUs so one GeMM is computed by several ranks in parallel; the Megatron-LM patterns (NVIDIA's 2024 Megatron-Core release exposes the kernels as a library) split QKV and MLP projections across the tensor-parallel group. Tensor parallelism is communication-heavy and stays inside a node where NVLink is fast; pipeline parallelism is communication-light and spans nodes.
Frontier training composes all three axes into 3D parallelism (DP across replica groups, PP across nodes, TP within a node), used by NVIDIA, Meta (Llama-3 added a 4D context-parallel variant), and DeepSeek. Megatron-LM and NeMo are the production stacks; Hugging Face's nanotron (2024) is the readable reference implementation.
N.1.6 Decision Table: Which Strategy Fits
The table maps training scenarios to a recommended strategy. The fits-on-one-GPU column assumes 80 GB H100; halve for A100-40GB, quarter for a 24 GB consumer card.
| Scenario | Fits on one GPU? | Recommended Strategy | Launcher | Typical Throughput Cost vs DDP |
|---|---|---|---|---|
| Fine-tuning <= 7B with LoRA on 8x H100 | Yes (LoRA trainables) | DDP | torchrun | Baseline (1.0x) |
| Full fine-tune 7B-13B on 8x H100 | No (Adam state 140 GB) | FSDP FULL_SHARD or ZeRO-2 | torchrun / accelerate | 1.1x to 1.3x |
| Full fine-tune 30B-70B on 32 GPUs | No | FSDP HYBRID_SHARD or ZeRO-3 | accelerate / deepspeed | 1.3x to 1.6x |
| Pretrain 70B on 64-128 GPUs | No | ZeRO-3 + TP=8 or FSDP + TP | deepspeed / Megatron | 1.5x to 2.0x |
| Pretrain 175B+ frontier model | No | 3D parallelism (DP + PP + TP) | Megatron-LM / NeMo | 2.0x to 3.0x |
| Single-node fine-tune, model exceeds VRAM | No | ZeRO-3 + CPU/NVMe offload | deepspeed | 3.0x to 10x (slow but feasible) |
Every step down the table trades throughput for feasibility. The throughput-cost column assumes a well-tuned, modern fabric (H100 with NVLink, InfiniBand between nodes). On Ethernet or under-provisioned NVMe, ZeRO-3 with offload can be 20x slower than DDP, not 3x. Profile a single step before committing to a multi-day run; the PyTorch profiler with NCCL traces shows where the time is going.
N.1.7 Production Realities and 2024-2026 Tooling
Two trends matter today. FSDP has eaten most of DeepSpeed's mind-share for jobs that fit without offload; PyTorch 2.4's FSDP2 closed the configurability gap and the Hugging Face stack defaults to FSDP. DeepSpeed remains essential for NVMe offload, MoE expert parallelism, or composition with pipeline parallelism. Composable launchers (Accelerate, PyTorch Lightning) let you switch between DDP, FSDP, and DeepSpeed via config flags. On hardware, the 2025 GB200 NVL72 treats 72 GPUs as one coherent memory pool, collapsing the tensor-vs-pipeline distinction for sub-500B models. Track MosaicML LLM Foundry and Databricks' training stack as living references.
- Chapter 6 (Pretraining and Scaling Laws) for why model size and compute are coupled.
- Section 19.11 (Libraries & Frameworks) for logging multi-rank runs to W&B / MLflow without per-rank duplication.
- Section 10.7 (Libraries & Frameworks) for what happens to the checkpoint after training: vLLM, TGI, and tensor-parallel serving.
- Section 19.15 (Libraries & Frameworks) for a higher-level orchestration layer that wraps the strategies in this section.
- Start from the memory math: 20 bytes per parameter for Adam in bf16, plus activations. If the static state fits on one GPU, use DDP; if not, you must shard.
- FSDP (HYBRID_SHARD for multi-node, FULL_SHARD for single node) is the modern default for fine-tuning up to ~70B; it ships with PyTorch and is supported by Accelerate and Lightning.
- DeepSpeed ZeRO-3 remains the right tool when you need CPU/NVMe offload, MoE expert parallelism, or to compose with pipeline parallelism in a single run.
- Beyond 30B-70B, add pipeline parallelism (PyTorch
distributed.pipeliningor Megatron); beyond that, add tensor parallelism within each node. Frontier models combine all three into 3D parallelism. - Always profile one step before a multi-day run. Fabric topology and NVMe speed change the throughput numbers more than the choice of library.
What's Next?
In the next section, Section 19.15: Ray Train, Ray Serve, and Ray Data, we build on the material covered here.