Section 19.14: Distributed Training Deep Dive

Distributed Training Deep Dive

Big Picture: What this section is

The practitioner reference for choosing and launching a parallelism strategy. Conceptual treatment of why scale matters is in Chapter 6 (Pretraining and Scaling Laws); this section is the engineering counterpart: which torchrun, accelerate, or deepspeed command to run, what changes in the script, and what the memory math says about whether your job will fit.

The five strategies here (DDP Distributed Data Parallel, FSDP Fully Sharded Data Parallel, ZeRO-1/2/3 the Zero Redundancy Optimizer from DeepSpeed, plus pipeline and tensor parallelism) compose into the 3D parallelism used to pretrain frontier models. We walk the ladder from the simplest case (model fits on one GPU, scale by adding ranks) up to the hardest (a 70B+ model sliced along layers and tensor dimensions across hundreds of GPUs). At each rung we cover the memory math, the launcher command, and the failure modes you actually hit. Cross-references to Section 19.11 (Libraries & Frameworks) for run logging and Section 10.7 (Libraries & Frameworks) for what happens after training are flagged inline.

Prerequisites

This section assumes basic familiarity with single-GPU PyTorch training (forward pass, backward pass, optimizer step) from Section 0.3, the Adam optimizer (which stores two running statistics per parameter), mixed-precision training (bf16 = brain-float-16), and the transformer block structure from Chapter 3. NCCL (NVIDIA Collective Communications Library) is the GPU-to-GPU communication backend that the strategies below rely on for fast all-reduce and all-gather operations.

N.1.1 The Memory Math: Why You Need Parallelism at All

Before choosing a strategy you need to know whether your job fits. For an LLM in mixed precision (bf16 weights, fp32 optimizer state), Adam needs roughly 20 bytes per parameter: 2 for the bf16 weight, 4 for the fp32 master, 4 for the gradient, 8 for the two Adam moments. A 7B model therefore needs ~140 GB of static optimizer-state memory before any activation; an 80 GB H100 cannot hold it without sharding. Activations are the second variable: a 7B Llama at seqlen 4096, batch 4 per GPU adds ~30-40 GB on top. Activation checkpointing (recompute on backward) trades ~30% extra FLOPs for a 4-8x activation-memory reduction and is on by default in modern scripts.

Key Insight

The right question is not "do I have enough GPUs" but "what is my static memory floor and what is my activation budget". DDP replicates the static memory on every GPU; FSDP and ZeRO-3 shard it. If your model's static state exceeds one GPU's VRAM, DDP is not an option no matter how many GPUs you have.

N.1.2 Distributed Data Parallel (DDP): The Simplest Case

DDP is the default when the model fits on a single GPU and you want to scale throughput by processing more data per step. Each rank holds a full replica, processes a different mini-batch, and synchronizes gradients via an all-reduce after the backward pass. The PyTorch implementation in torch.distributed uses NCCL on NVIDIA hardware and is the baseline against which every other strategy is measured. The script reads its rank from environment variables and wraps the model in DistributedDataParallel:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16
    ).to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    # ... training loop with DistributedSampler ...
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Code Fragment 19.14.1: DDP is the default when the model fits on a single GPU and you want to scale throughput by processing more data.

Launch with torchrun --nproc-per-node=8 train.py on a single node, or torchrun --nnodes=4 --nproc-per-node=8 --rdzv-backend=c10d --rdzv-endpoint=$MASTER:29500 train.py for multi-node. The torchrun docs describe elastic restart semantics; SageMaker and Azure ML wrap this with rendezvous services so you do not maintain the cluster yourself.

N.1.3 Fully Sharded Data Parallel (FSDP)

When the model no longer fits on one GPU, the cheapest upgrade is FSDP (FSDP2, the rewritten API, stable in PyTorch 2.4 (2024)). FSDP shards weights, gradients, and optimizer states across ranks; each rank gathers parameters for the current layer, computes, then frees them. Two strategies: FULL_SHARD (lowest memory) and HYBRID_SHARD (full shard within a node, replicate across nodes), the default for bandwidth-limited fabrics:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
import functools, torch

mp = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)
wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={LlamaDecoderLayer},
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.HYBRID_SHARD,  # FULL_SHARD for single node
    auto_wrap_policy=wrap_policy,
    mixed_precision=mp,
    device_id=torch.cuda.current_device(),
    use_orig_params=True,
)

Code Fragment 19.14.2: When the model no longer fits on one GPU, the cheapest upgrade is FSDP (FSDP2, the rewritten API.

The auto_wrap_policy matters more than people realise: wrap at the transformer-block level for clean compute-communication overlap. Wrap too coarsely (whole model) and you defeat sharding; wrap too finely (every nn.Linear) and you flood NCCL with tiny collectives. The Hugging Face Accelerate FSDP guide (updated 2025) documents the wrap policies for the major model families.

N.1.4 DeepSpeed ZeRO: Three Stages and Two Offloads

ZeRO (Zero Redundancy Optimizer) from Microsoft predates FSDP and remains the standard when you need offload to CPU or NVMe. ZeRO-1 shards optimizer states; ZeRO-2 also shards gradients; ZeRO-3 also shards parameters (equivalent to FSDP FULL_SHARD). ZeRO-Infinity spills optimizer states and parameters to NVMe, letting a single node train a 100B+ model at the cost of slower steps. The DeepSpeed Ulysses work (March 2024) added sequence parallelism for long-context training, enabling million-token-context fine-tuning on commodity clusters. DeepSpeed remains the project of choice when you need to mix ZeRO with pipeline parallelism in the same run.

N.1.5 Pipeline and Tensor Parallelism: When Sharding is Not Enough

Beyond roughly 30B parameters, sharding alone runs out of room. Pipeline parallelism cuts the model into stages along its depth and streams micro-batches through them; the GPipe scheme and its 1F1B refinement are built into PyTorch as torch.distributed.pipelining (stable in PyTorch 2.5, late 2024). Tensor parallelism splits individual matrices across GPUs so one GeMM is computed by several ranks in parallel; the Megatron-LM patterns (NVIDIA's 2024 Megatron-Core release exposes the kernels as a library) split QKV and MLP projections across the tensor-parallel group. Tensor parallelism is communication-heavy and stays inside a node where NVLink is fast; pipeline parallelism is communication-light and spans nodes.

Frontier training composes all three axes into 3D parallelism (DP across replica groups, PP across nodes, TP within a node), used by NVIDIA, Meta (Llama-3 added a 4D context-parallel variant), and DeepSeek. Megatron-LM and NeMo are the production stacks; Hugging Face's nanotron (2024) is the readable reference implementation.

N.1.6 Decision Table: Which Strategy Fits

The table maps training scenarios to a recommended strategy. The fits-on-one-GPU column assumes 80 GB H100; halve for A100-40GB, quarter for a 24 GB consumer card.

**Table 19.14.1:** *The table maps training scenarios to a recommended strategy.*
Scenario	Fits on one GPU?	Recommended Strategy	Launcher	Typical Throughput Cost vs DDP
Fine-tuning <= 7B with LoRA on 8x H100	Yes (LoRA trainables)	DDP	`torchrun`	Baseline (1.0x)
Full fine-tune 7B-13B on 8x H100	No (Adam state 140 GB)	FSDP FULL_SHARD or ZeRO-2	`torchrun` / `accelerate`	1.1x to 1.3x
Full fine-tune 30B-70B on 32 GPUs	No	FSDP HYBRID_SHARD or ZeRO-3	`accelerate` / `deepspeed`	1.3x to 1.6x
Pretrain 70B on 64-128 GPUs	No	ZeRO-3 + TP=8 or FSDP + TP	`deepspeed` / Megatron	1.5x to 2.0x
Pretrain 175B+ frontier model	No	3D parallelism (DP + PP + TP)	Megatron-LM / NeMo	2.0x to 3.0x
Single-node fine-tune, model exceeds VRAM	No	ZeRO-3 + CPU/NVMe offload	`deepspeed`	3.0x to 10x (slow but feasible)

Warning: Throughput vs Feasibility

Every step down the table trades throughput for feasibility. The throughput-cost column assumes a well-tuned, modern fabric (H100 with NVLink, InfiniBand between nodes). On Ethernet or under-provisioned NVMe, ZeRO-3 with offload can be 20x slower than DDP, not 3x. Profile a single step before committing to a multi-day run; the PyTorch profiler with NCCL traces shows where the time is going.

N.1.7 Production Realities and 2024-2026 Tooling

Two trends matter today. FSDP has eaten most of DeepSpeed's mind-share for jobs that fit without offload; PyTorch 2.4's FSDP2 closed the configurability gap and the Hugging Face stack defaults to FSDP. DeepSpeed remains essential for NVMe offload, MoE expert parallelism, or composition with pipeline parallelism. Composable launchers (Accelerate, PyTorch Lightning) let you switch between DDP, FSDP, and DeepSpeed via config flags. On hardware, the 2025 GB200 NVL72 treats 72 GPUs as one coherent memory pool, collapsing the tensor-vs-pipeline distinction for sub-500B models. Track MosaicML LLM Foundry and Databricks' training stack as living references.

What's Next?

In the next section, Section 19.15: Ray Train, Ray Serve, and Ray Data, we build on the material covered here.

Further Reading

Distributed Training

Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC 2020. arXiv:1910.02054. The original ZeRO paper that underlies DeepSpeed and FSDP.

Zhao, Y., et al. (2023). "PyTorch FSDP." VLDB 2023. arXiv:2304.11277. The FSDP paper.

Shoeybi, M., et al. (2019). "Megatron-LM." arXiv:1909.08053. Reference for tensor parallelism.

Huang, Y., Cheng, Y., Bapna, A., et al. (2019). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." NeurIPS 2019. arXiv:1811.06965. Reference for pipeline parallelism.