"Megatron, DeepSpeed, FSDP, and FlashAttention walk into a config file; the model emerges three weeks later at 47 percent MFU and counts that as a win."
Tensor, Parallelism-Stack-Wiring AI Agent
Distributed training and scale-out inference are dominated by a stack of libraries that compose into a working system. At the foundation sit the parallelism implementations: Megatron-LM (tensor and pipeline parallel), DeepSpeed (ZeRO sharded data parallel), PyTorch FSDP (the in-tree successor to DeepSpeed ZeRO), and Colossal-AI (a kitchen-sink hybrid). A layer above sit the high-level training recipes that wrap those primitives: Hugging Face Accelerate, Axolotl, LLaMA-Factory, Levanter, torchtitan, TRL. A third layer covers the optimization kernels that determine whether you achieve 30 percent or 50 percent MFU: FlashAttention 2 / 3, xformers, bitsandbytes, NVIDIA Transformer Engine for FP8, MS-AMP, torch.compile. A fourth layer covers the communication primitives: NCCL, NVSHMEM, MSCCL, MPI, Gloo, UCX. A fifth covers orchestration: Ray Train, Skypilot, Composer, Flyte. And a sixth covers compilers: torch.compile / TorchDynamo / TorchInductor, TensorRT, OpenAI Triton. Most production training stacks use 8 to 15 of these libraries together; the choice of foundation parallelism library typically dictates the rest.
Distributed training in 2026 is mature in the sense that the basic primitives (data parallel, tensor parallel, pipeline parallel, ZeRO / FSDP, expert parallelism for MoE) are widely available and well-understood. It is immature in the sense that putting them together for a specific model on a specific cluster still requires substantial expertise: the difference between a naive multi-GPU run at 25 percent MFU and a tuned run at 50 percent MFU is roughly a doubling of cost, and the gap is hidden behind many small choices (which attention kernel, which dataloader, which gradient accumulation strategy, which microbatch size, which communication overlap pattern).
61.2.1 Foundation distributed training frameworks
The foundation layer implements the parallelism strategies that allow a model larger than a single GPU to train across many GPUs. Megatron-LM, DeepSpeed, FSDP, and Colossal-AI are the main options.
- Megatron-LM (NVIDIA, 2019; Megatron-Core 2023; ongoing) is NVIDIA's reference distributed training framework for very large language models, originating in the 2019 paper that introduced tensor parallelism. Its objective is to deliver the highest possible MFU on multi-thousand-GPU NVIDIA clusters via hand-tuned 3D parallelism (data, tensor, pipeline) plus expert parallelism for MoE, which matters because at frontier scale a few percent of MFU translates to millions of dollars per training run. The core concept is the Megatron-Core composable parallelism library (data, tensor, sequence, pipeline, context, expert parallel) plus the higher-level Megatron-LM training script. Pick Megatron-LM (or Megatron-Core directly) when you are training a 70B+ dense model from scratch on H100 / H200 / B200 capacity, when MFU is the dominant cost driver, and when your team can absorb the C++ / CUDA-heavy operational complexity; avoid for smaller fine-tuning where Accelerate or torchtitan deliver comparable performance with much less setup.
- Microsoft DeepSpeed (Microsoft, 2020; ZeRO-1/2/3 2020-2021; DeepSpeed-Inference 2022; DeepSpeed-Chat / DeepSpeed-MoE / DeepSpeed-Ulysses 2023-2024) is the framework that introduced ZeRO (Zero Redundancy Optimizer) and remains the canonical Microsoft answer to scaling training. Its objective is to make training models that do not fit in a single GPU's memory tractable on commodity GPU clusters by sharding optimizer state, gradients, and parameters across data-parallel ranks (ZeRO-1, -2, -3 respectively), which matters because before ZeRO, fitting a 13B+ model required tensor parallelism that many teams could not implement. The core concept is the ZeRO partitioning protocol that turns memory consumption from O(N) per rank into O(N/P) with P data-parallel ranks. Pick DeepSpeed when you want a battle-tested ZeRO implementation and the higher-level recipes (DeepSpeed-Chat for RLHF, DeepSpeed-MoE for sparse models); for new projects, PyTorch FSDP has narrowed the feature gap and is in-tree.
- PyTorch FSDP (Fully Sharded Data Parallel) and FSDP2 (Meta, 2022; FSDP2 in 2024) is PyTorch's in-tree fully-sharded data parallelism implementation, the direct successor to DeepSpeed ZeRO inside the PyTorch ecosystem. Its objective is to bring ZeRO-3-style optimizer / gradient / parameter sharding into PyTorch core so that scaling does not require a third-party dependency, which matters because in-tree code follows PyTorch's release cadence and integrates with torch.compile and other PyTorch features more cleanly. The 2024 FSDP2 redesign improved memory efficiency and composability with tensor parallel and pipeline parallel. Pick FSDP / FSDP2 for new projects on PyTorch when you want sharded data parallelism without an extra library; expect to pair with tensor parallelism (via PyTorch's TP API or Megatron-Core) for 70B+ models.
- PyTorch Distributed (DDP, FSDP, Tensor Parallel API) (Meta, 2018+) is the umbrella module that contains all the in-tree PyTorch distributed primitives: DistributedDataParallel (the classic data-parallel wrapper), FSDP / FSDP2, the DTensor / DeviceMesh abstractions for tensor parallel, and the c10d collective communication library. Its objective is to provide the in-tree building blocks for any distributed strategy without forcing a specific framework choice. The core concept is the DeviceMesh: a logical grid of devices over which a tensor can be sharded along multiple axes. Pick the in-tree primitives directly when you are building a custom training framework or when you specifically want to avoid Megatron / DeepSpeed; expect to write more code than you would with a higher-level library.
- Hugging Face Accelerate (Hugging Face, 2021) is the high-level launcher and abstraction layer that lets the same PyTorch training script run on a single GPU, multiple GPUs, multiple nodes, with DDP, FSDP, or DeepSpeed as the backend, without code changes. Its objective is to be the thinnest possible abstraction over the various PyTorch distributed backends, which matters because picking between DDP / FSDP / DeepSpeed should be a runtime configuration choice rather than a code rewrite. Pick Accelerate when you want to write portable distributed-training code that escapes the framework lock-in; for production frontier training where every percent of MFU matters, you often drop down to Megatron-Core or torchtitan.
- Colossal-AI (HPC-AI Tech, 2021) is an open-source distributed training framework that integrates many parallelism strategies (ZeRO-style sharding, tensor parallel, pipeline parallel, sequence parallel) into a single unified API. Its objective is to give a single library that can substitute for the combination of Megatron + DeepSpeed for teams who prefer one dependency over two. Pick Colossal-AI when you want one framework rather than composing Megatron and DeepSpeed; for the most cutting-edge features (FP8 training, latest MoE patterns), the underlying NVIDIA / Microsoft / Meta frameworks tend to lead.
61.2.2 High-level training recipes and fine-tuning libraries
Above the foundation distributed-training layer sit libraries that wrap the primitives into ready-to-run recipes for common workloads (pretraining a 7B from scratch, fine-tuning Llama-3 70B with LoRA, RLHF on a small base).
- torchtitan (Meta / PyTorch, 2024) is Meta's reference implementation of large-scale LLM pretraining using only in-tree PyTorch primitives (FSDP2, tensor parallel via DTensor, pipeline parallel, async checkpointing, float8 training via NVIDIA Transformer Engine integration). Its objective is to be the canonical "how Meta does it" reference for production-scale pretraining on PyTorch without third-party frameworks, which matters because it tracks PyTorch core development and demonstrates the canonical patterns. The core concept is a single training loop that composes 4D parallelism (data, tensor, pipeline, context) via DTensor / DeviceMesh. Pick torchtitan as the reference implementation for greenfield PyTorch-native pretraining; the codebase is intentionally readable as a teaching resource.
- Levanter (Stanford CRFM, 2023) is Stanford's open-source JAX-based pretraining framework, designed around named-tensor axis manipulation (via the Haliax library) so that the same training script runs identically on TPU pods and GPU clusters. Its objective is to be the readable reference JAX pretraining framework with TPU-first design, which matters when you are training on TPU pods (no Megatron-style framework dominates that ecosystem). Pick Levanter for JAX-based pretraining, especially on TPU; for PyTorch / CUDA, torchtitan or Megatron-Core dominate.
- Axolotl (Axolotl AI / OpenAccess AI Collective, 2023) is the most popular open-source fine-tuning framework, wrapping Hugging Face Transformers, Accelerate / DeepSpeed / FSDP, PEFT (LoRA / QLoRA), and TRL into a single YAML-configured pipeline. Its objective is to make "fine-tune Llama-3 70B with QLoRA on 8 H100s" a 30-minute setup rather than a week of plumbing, which matters because the fine-tuning long tail is most of the practical training work. Pick Axolotl as the production default for fine-tuning open-weight models; for pretraining-from-scratch, the foundation frameworks are better-suited.
- LLaMA-Factory (Hiyouga, 2023) is a competing open-source fine-tuning framework with a slightly more Asian-research-community user base, similar capabilities to Axolotl (covers SFT, DPO, PPO, ORPO, KTO across LoRA / QLoRA / full fine-tune), and a more polished web UI. Pick LLaMA-Factory when you want a GUI for fine-tuning and the Axolotl YAML idioms do not fit; for the same coverage with a YAML-first workflow, Axolotl is the canonical choice.
- Hugging Face TRL (Hugging Face, 2022) is the canonical library for transformer reinforcement learning: SFT, DPO, PPO, GRPO, ORPO, KTO, and the supporting infrastructure for reward models and preference datasets. Its objective is to provide a unified API for the full post-training stack (SFT through PPO / GRPO), which matters because every project that builds an RLHF / RLAIF pipeline needs these primitives. Pick TRL as the reference RL-for-LLMs library; Axolotl and LLaMA-Factory both delegate to it for the underlying algorithms.
- OpenRLHF (OpenRLHF, 2023) is an alternative RLHF framework built on Ray + vLLM + DeepSpeed, distinguished by its production-grade Ray-based architecture for the generation / training split that PPO and GRPO need. Pick OpenRLHF when you are running large-scale RLHF and TRL's single-process or multi-process model is not fast enough; for typical workflows, TRL is simpler.
61.2.3 Optimization kernels, attention, and mixed precision
The choice of attention kernel and quantization library determines whether your training run achieves 30 percent or 55 percent MFU on the same hardware. The 2026 stack is dominated by FlashAttention, xformers, bitsandbytes, NVIDIA Transformer Engine, and torch.compile.
- FlashAttention 2 and 3 (Tri Dao et al., 2022; FA2 in 2023; FA3 in 2024) is the IO-aware exact-attention kernel that became the default attention implementation in every major LLM training framework. Its objective is to compute softmax attention with O(N^2) FLOPs but O(N) memory by tiling the softmax computation and never materializing the full attention matrix, which matters because the memory bottleneck was the binding constraint for long-context training before FlashAttention. The 2024 FlashAttention 3 release added FP8 and asynchronous warp specialization for H100, pushing kernel utilization above 70 percent of peak. Pick FlashAttention 2 or 3 as the default attention kernel; almost every framework integrates it transparently.
- xformers (Meta, 2022) is Meta's library of efficient transformer building blocks, including memory-efficient attention (predating and partially independent of FlashAttention), sparse attention variants, and a collection of kernels used in Stable Diffusion and Meta's internal training. Its objective is to provide a broader set of attention variants than FlashAttention alone (sparse, blocksparse, ALiBi-aware, etc.), which matters when your model uses non-standard attention patterns. Pick xformers for attention variants beyond standard softmax (sparse, blocksparse); for vanilla causal attention, FlashAttention 3 is the leader.
- bitsandbytes (Tim Dettmers, 2022) is the library that introduced 8-bit and 4-bit quantization for LLMs and the QLoRA training method. Its objective is to make training and fine-tuning quantized models economically feasible (run a 70B QLoRA fine-tune on a single 80GB GPU), which matters because before bitsandbytes, fine-tuning large models required multi-node setups. The core concept is the 4-bit NormalFloat (NF4) quantization plus paged optimizers that swap to CPU when GPU memory is exhausted. Pick bitsandbytes for QLoRA fine-tuning of large open-weight models; for inference quantization, GPTQ / AWQ / EXL2 are more efficient (covered in Chapter 60).
- NVIDIA Transformer Engine (NVIDIA, 2022; H100 FP8 support; 2024 B200 / GB200 expansions) is NVIDIA's FP8 mixed-precision library that integrates with Megatron, PyTorch (via te.Linear and friends), and JAX. Its objective is to deliver the FP8 throughput advantage of H100 / H200 / B200 hardware (2x throughput versus BF16 on H100, more on B200) by managing the per-tensor scaling factors that FP8 training requires, which matters because FP8 training is now the default at frontier scale. Pick Transformer Engine when you are training on H100 / H200 / B200 and the framework supports it; expect 1.5x to 2x speedup with proper integration.
- MS-AMP (Microsoft, 2023) is Microsoft's FP8 mixed-precision training library, an alternative to Transformer Engine with a more aggressive "all-FP8" approach (including FP8 communication for some collectives). Pick MS-AMP when you specifically want Microsoft's FP8 patterns or when you are integrating with Microsoft's training stacks; for the median 2026 production case, Transformer Engine has more momentum.
- torch.compile (TorchDynamo + TorchInductor) (Meta, 2022; PyTorch 2.0 GA) is PyTorch's compiler stack, capturing Python-level eager-mode code into a graph (TorchDynamo) and lowering it through Inductor to optimized kernels (often Triton). Its objective is to deliver compiler-grade speedups for eager-mode PyTorch code without rewriting models, which matters because the alternative (TorchScript / TensorRT export) is brittle and requires substantial model surgery. Pick torch.compile for any new PyTorch training or inference workflow; the 2024-26 maturity is enough for production use.
- NVIDIA Apex (NVIDIA, 2018; mostly superseded by in-tree PyTorch) was the original NVIDIA library for mixed-precision training, fused optimizers, and distributed primitives. Most of its functionality has migrated into PyTorch core. Pick Apex only when a specific Apex kernel (fused Adam, fused LayerNorm) is faster than the PyTorch in-tree equivalent; for new code, prefer in-tree primitives.
The trick behind FlashAttention (Dao et al. 2022, Eq. 7-9) is that the softmax of a sequence can be computed one block at a time without ever materializing the full $N \times N$ score matrix. The standard softmax of $\{s_1, \ldots, s_N\}$ is $\operatorname{softmax}(s_i) = e^{s_i - m} / \ell$ with $m = \max_i s_i$ and $\ell = \sum_i e^{s_i - m}$. For a block decomposition $S = S_1 \mid S_2 \mid \ldots \mid S_K$ the running state after block $i$ is $(m_i, \ell_i, O_i)$, updated by the recurrence:
$$m_i = \max\big(m_{\text{i-1}},\ m^{\text{block}}_{S_i}\big)$$
$$\ell_i = e^{m_{\text{i-1}} - m_i}\,\ell_{i-1} + e^{m^{\text{block}}_{S_i} - m_i}\,\ell^{\text{block}}_{S_i}$$
$$O_i = e^{m_{\text{i-1}} - m_i}\,O_{\text{i-1}} + e^{m^{\text{block}}_{S_i} - m_i}\big(P^{\text{block}}_{S_i} \cdot V_{\text{S_i}}\big)$$
where $m^{\text{block}}_{S_i}$ and $\ell^{\text{block}}_{S_i}$ are the local max and local denominator for block $i$, $P^{\text{block}}_{S_i}$ is the local un-normalized softmax, and $O_i$ is the running output. After processing all $K$ blocks, $O = O_K / \ell_K$ is the exact softmax-weighted output. The blocks $Q_i, K_j, V_j$ live in fast SRAM (a few KB on H100) while $Q, K, V$ stay in HBM, so the IO traffic is $O(N \cdot d)$ rather than $O(N^2)$ and the attention matrix is never materialized: memory drops from $O(N^2)$ to $O(N)$ at the cost of slightly more FLOPs from rescaling. FlashAttention 3 adds FP8 and asynchronous warp specialization to push utilization past 70% of peak on H100. The same recurrence underpins every long-context kernel of 2023-2026 (FlashDecoding, Ring Attention, Tree Attention, the paged-KV-cache implementations).
If you want to spin up a serious training run with the minimum library count, the 2026 "thinnest viable stack" is four pieces: PyTorch FSDP2 (in-tree, no extra dependency, handles sharded data parallel and composes with TP/PP via DTensor); Megatron-LM or Megatron-Core (for the tensor-parallel kernels and the production-grade attention / FFN implementations); Weights & Biases (for run logging and the system-of-record); and Slurm (for gang-scheduling and queue management on the cluster). Optionally add FlashAttention 3 as a one-line drop-in for the attention kernel and NVIDIA Transformer Engine for FP8. This 4-to-6-library stack is what most academic-lab and frontier-lab teams converge on; the longer composition patterns of Section 61.2.7 are richer but the minimum is leaner than it appears at first read.
Show code
# Minimum viable distributed-training launch: FSDP2 + W&B + torchrun.
import torch, wandb
from torch.distributed._composable.fsdp import fully_shard
from torch.distributed.device_mesh import init_device_mesh
# 1. Build a device mesh and shard the model with FSDP2.
mesh = init_device_mesh("cuda", (torch.distributed.get_world_size(),))
model = build_my_transformer().cuda()
fully_shard(model, mesh=mesh) # in-place sharding
# 2. Log to W&B as the run system-of-record.
if torch.distributed.get_rank() == 0:
wandb.init(project="llm-pretrain", config={"world_size": mesh.size()})
# 3. Launch with: torchrun --nproc_per_node=8 --nnodes=$SLURM_NNODES script.py
61.2.4 Communication libraries
Underneath every distributed training framework sits a collective communication library that implements all-reduce, all-gather, reduce-scatter, and point-to-point sends and receives. NCCL dominates for NVIDIA, with NVSHMEM, MSCCL, MPI, Gloo, and UCX as alternatives for specific patterns.
- NCCL (NVIDIA Collective Communications Library) (NVIDIA, 2015) is the dominant collective communication library for multi-GPU and multi-node training on NVIDIA hardware. Its objective is to deliver near-peak interconnect bandwidth for collectives (all-reduce, all-gather, reduce-scatter, etc.) across NVLink, InfiniBand, and Ethernet, with topology-aware ring and tree algorithms, which matters because the gradient all-reduce is the single largest off-chip data transfer in data-parallel training. Pick NCCL as the default (it is implicit in PyTorch, JAX, etc. on NVIDIA hardware); the alternatives only come up when NCCL specifically has a limitation.
- NVSHMEM (NVIDIA, 2017+) is NVIDIA's implementation of the OpenSHMEM PGAS (partitioned global address space) model for GPU clusters, supporting one-sided communication primitives. Its objective is to enable fine-grained one-sided RDMA-style communication that NCCL's collective model does not express well, which matters for kernels that need to pull a specific remote tensor without a full collective. Pick NVSHMEM for custom kernels that need one-sided communication; for standard collectives, NCCL is simpler.
- MSCCL (Microsoft Collective Communication Library) (Microsoft, 2021+) is Microsoft's library for synthesizing custom collective algorithms tuned for specific cluster topologies, which can significantly outperform NCCL on specific topologies (Microsoft documented up to 2x for certain all-reduce patterns on their Azure GB200 fabrics). Pick MSCCL when you are running on Azure NDv5+/GB200 clusters where Microsoft has tuned algorithms; for the median NVIDIA cluster, NCCL is sufficient.
- Gloo (Meta, 2017) is Meta's CPU-friendly collective communication library, used as the default backend for CPU-only distributed PyTorch and as a fallback when NCCL is unavailable. Pick Gloo when you specifically need CPU collectives (rare in LLM training); on GPU clusters, NCCL is the default.
- MPI (Open MPI / MPICH) (community, 1994+) is the venerable HPC message-passing standard, still the launcher of choice on many Slurm-based HPC clusters (mpirun / mpiexec). Its objective is portable parallel communication; in LLM training it is typically the launcher rather than the data-transport library (NCCL handles the collectives, MPI just spawns the processes). Pick MPI when your cluster's launcher is mpirun; for Kubernetes-native launches, torchrun and torch.distributed.run are the default.
- UCX (Unified Communication X) (Mellanox / NVIDIA, 2015+) is a unified communication framework that NCCL, MPI, and other libraries can use as a transport layer over InfiniBand, RoCE, or shared memory. Pick UCX when your stack specifically requires it (some HPC environments); end users rarely interact with it directly.
61.2.5 Orchestration and ML pipeline frameworks
Above the training-framework layer sit orchestrators that manage multi-stage ML pipelines (data prep, train, eval, deploy), multi-cluster compute (run on this cloud when capacity is cheaper), and multi-experiment scheduling.
- Ray Train (Anyscale, 2022+) is the Ray-based distributed training library, providing a unified API across PyTorch, JAX, and DeepSpeed for launching multi-node training on a Ray cluster. Its objective is to make multi-node training a few-line addition to an existing single-node script, with Ray handling the process placement and fault tolerance, which matters when Ray is already in your stack (for Ray Serve, Ray Data, Ray Tune). Pick Ray Train when you are committed to Ray as the orchestration framework; for pure single-job training without Ray, torchrun is simpler.
- SkyPilot (UC Berkeley, 2022) is the multi-cloud abstraction that lets you launch the same training job on any of AWS, GCP, Azure, OCI, Lambda, RunPod, or local, with automatic capacity hunting and spot-failover. Its objective is to give multi-cloud price-arbitrage and capacity-fallback to teams that do not want vendor lock-in, which matters when GPU availability is the binding constraint and your job can run on whichever cloud has capacity today. Pick SkyPilot when your team uses more than one cloud or when you need to chase capacity; for single-cloud teams, the native job submission is simpler.
- MosaicML Composer (MosaicML / Databricks, 2022) is MosaicML's training framework distinguished by a library of "speedup algorithms" (algorithmic optimizations like progressive resizing, selective backprop) that compose into the training loop. Its objective is to deliver wall-clock training speedups via algorithmic rather than purely systems optimizations, which matters when you have an existing PyTorch training script and want efficiency gains without rewriting the training loop in a more aggressive framework. Pick Composer when you want a higher-level Trainer abstraction with built-in algorithmic optimizations; for raw-systems performance, the foundation frameworks are leaner.
- Flyte (Lyft, 2019; LF AI & Data 2020+) is a Kubernetes-native workflow engine for data and ML pipelines, used by teams that need typed, versioned, parameterized pipelines for ML lifecycle management. Pick Flyte when your pipeline lifecycle is multi-stage and needs strong typing / versioning; for simpler container DAGs, Argo Workflows is leaner.
- Prefect (Prefect, 2018) and Dagster (Elementl, 2018) are Python-first workflow orchestrators, often used to wrap training pipelines plus the surrounding data engineering. Pick Prefect or Dagster when your pipeline crosses the train / data / serve boundary and you want a Python-native orchestrator; for pure training scheduling, Slurm or Kubernetes-native operators are more direct.
61.2.6 Compilers and runtime optimization
The compiler layer transforms model definitions into optimized kernels for specific hardware. The 2026 stack is dominated by torch.compile / TorchInductor, OpenAI Triton, and TensorRT for inference.
- OpenAI Triton (OpenAI, 2021) is a Python DSL for writing GPU kernels at a higher level than CUDA, used both directly (in FlashAttention, in TorchInductor codegen) and as a backend by torch.compile. Its objective is to make custom GPU kernel authoring tractable for ML researchers without requiring CUDA expertise, which matters because custom kernels are how you close the last 10 to 20 percent of MFU gap to peak. The core concept is the @triton.jit decorator and the BLOCK / PROGRAM_ID abstractions. Pick Triton when you need a custom kernel that no library provides; for the median workload, torch.compile generates Triton kernels automatically.
- NVIDIA TensorRT and TensorRT-LLM (NVIDIA, 2017; TensorRT-LLM 2023) is NVIDIA's inference compiler, with TensorRT-LLM specifically targeting transformer inference. Its objective is to deliver maximum inference throughput on NVIDIA hardware via kernel fusion, quantization (INT8 / FP8 / INT4), and runtime-specific optimizations (in-flight batching, paged KV cache), which matters because vLLM-class throughput is often bested by TensorRT-LLM at the cost of more setup work. Pick TensorRT-LLM for production inference where every percent of throughput matters and your team can absorb the integration effort; for development and prototyping, vLLM (Section 31.2) is simpler.
- XLA (Accelerated Linear Algebra) (Google / OpenXLA, 2017) is the JAX / TensorFlow ahead-of-time compiler, also used as a backend for PyTorch (PyTorch/XLA). Its objective is to lower high-level tensor programs to optimized kernels for TPUs and (increasingly) GPUs, which matters because XLA is the foundation of TPU programming. Pick XLA implicitly when you write JAX or use TPU; for CUDA-first PyTorch, torch.compile is the default.
- ONNX Runtime (Microsoft, 2018) is the cross-platform inference runtime built on the ONNX model exchange format. Its objective is to deliver production inference across CPU, GPU, mobile, and edge with one runtime, which matters for cross-platform deployment (covered in Chapter 60). For training, ONNX Runtime is rarely used; for cross-platform inference, it remains important.
61.2.7 Common stack composition patterns
Real production stacks compose several of these libraries together. The 2026 patterns:
- "NVIDIA reference pretraining": Megatron-LM / Megatron-Core + FlashAttention 3 + Transformer Engine FP8 + NCCL + Slurm-on-HyperPod-or-CoreWeave + Weights and Biases. This is what NVIDIA's published reference recipes look like and what most large dense pretraining runs use.
- "PyTorch-native pretraining": torchtitan + FSDP2 + Tensor Parallel (via DTensor) + FlashAttention 3 + Transformer Engine + torch.compile + NCCL + torchrun + W&B. The 2024-26 alternative to the Megatron stack, emerging as Meta's reference and increasingly competitive.
- "JAX / TPU pretraining": Levanter (or MaxText) + Haliax + Pallas / XLA + Pathways / Vertex + W&B. The dominant stack on TPU pods.
- "Open-source fine-tuning": Axolotl + Hugging Face Transformers / Accelerate + PEFT (LoRA / QLoRA via bitsandbytes) + FlashAttention 2 + TRL for DPO / PPO + DeepSpeed or FSDP backend + W&B. The default for fine-tuning open-weight models in 2026.
- "Production RLHF": OpenRLHF + Ray Train + vLLM (for fast rollouts) + DeepSpeed + W&B. The stack for serious RLHF / RLAIF runs.
- "Multi-cloud training": any of the above plus SkyPilot for capacity hunting and Modal for serverless burst capacity.
Model FLOPs Utilization (MFU) is the standard metric: (achieved FLOPs/second) divided by (theoretical peak FLOPs/second of the hardware), reported as a percentage. A well-tuned dense transformer pretraining run achieves 40 to 55 percent MFU on H100; a poorly-tuned run drops to 15 to 25 percent. The gap is closed almost entirely by library choices: FlashAttention versus naive attention is the biggest single lever (2 to 4x speedup), Transformer Engine FP8 versus BF16 is the second (1.5 to 2x on H100, more on B200), torch.compile is the third (10 to 30 percent), the right communication library and overlap pattern is the fourth (5 to 15 percent at multi-node scale). Track MFU on every training run and treat any drop below 40 percent on a frontier GPU as a bug to investigate.
61.2.8 Mapping the library stack
61.2.9 Library evaluation checklist
When picking a library for a new training project, the questions that surface real differences:
- What parallelism strategies does it support natively? DP only, FSDP, TP, PP, EP, CP, SP? At what model size do you outgrow each?
- What is the per-GPU MFU on a reference workload? The honest libraries publish reference numbers (Megatron, torchtitan, Levanter); the rest you have to measure.
- What hardware does it support? CUDA-only? CUDA plus AMD ROCm? TPU? Trainium? Apple Silicon (for development)?
- What is the checkpoint format? Compatible with the rest of your ecosystem (Hugging Face safetensors, DCP, etc.)? Resumable across different parallelism configurations?
- What is the integration story with the surrounding tooling? Does it integrate with W&B / MLflow out of the box? With LoRA / QLoRA? With FlashAttention? With Transformer Engine?
- What is the test and CI surface? Frontier-scale libraries are hard to test; check for nightly regression suites and the time-to-fix on critical issues.
- What is the maintainer model? Vendor-led (NVIDIA, Microsoft, Meta, Hugging Face) versus community-only? Single-maintainer libraries are convenient but fragile.
- What is the Python / PyTorch / CUDA version pinning policy? Many of these libraries lag PyTorch by a few months; check that your target versions are supported.
- What is the documentation depth? Production frameworks (Megatron, DeepSpeed, FSDP, torchtitan) ship reference configurations and design documents; thinly-documented libraries are a maintenance risk.
A 2025 research team was choosing between PyTorch FSDP2 and DeepSpeed ZeRO-3 for a 7B-to-13B-scale pretraining experiment on 32 H100 nodes. The team's evaluation: (a) FSDP2 had narrowed or closed the historical performance gap to DeepSpeed for their model size; (b) FSDP2 composed more cleanly with torch.compile, while DeepSpeed had known sharp edges with the PyTorch compiler stack; (c) FSDP2 was in-tree and tracked PyTorch nightlies, while DeepSpeed required pinning. The team picked FSDP2 and reported a 5 percent MFU win over their DeepSpeed baseline once torch.compile was enabled. The lesson generalizes: for new projects at small-to-mid scale, FSDP2 is the default; for very large dense models and for projects already invested in DeepSpeed-MoE / DeepSpeed-Ulysses, DeepSpeed retains specific advantages.
61.2.10 The FP8 and low-precision training shift
The biggest 2024-2026 shift in training libraries was the move from BF16 to FP8 mixed precision as the default on H100 / H200 / B200 hardware. The hardware support has been in place since H100 (2022), but the library and recipe maturity required for production FP8 training came in stages: NVIDIA Transformer Engine matured through 2023-2024, the FP8 reference recipes in Megatron-Core and torchtitan landed in 2024, and FP8 training is now the default for new frontier pretraining runs in 2026. The practical effect is roughly 1.7x to 2.2x training throughput on H100 versus BF16 with proper integration, and substantially more on B200 where the FP8 hardware is more advanced relative to BF16.
The library choices that matter for FP8 training: Transformer Engine (the reference, integrated into Megatron-Core and torchtitan), MS-AMP (Microsoft's alternative), and the in-tree PyTorch float8 module (the early-2024 experimental support that has been hardening through 2025-2026). For RTX 5090 / B200 / GB200 deployments, even more aggressive precision (FP6, FP4 for inference, MXFP4 for both) is becoming production-relevant; the library support is rapidly evolving and worth tracking on a quarterly cadence.
61.2.11 Stack composition by training stage
The training stage determines which subset of the library stack actually matters. The 2026 layered pattern:
- Pretraining (cold-start, 100B to 10T tokens): requires the foundation distributed-training framework (Megatron, torchtitan, Levanter), the full optimization-kernel stack (FlashAttention 3, Transformer Engine, torch.compile), the highest-bandwidth communication library (NCCL with tuned algorithms), and parallel storage (Lustre / Weka). The high-level recipe libraries (Axolotl, TRL) are absent; the work is too low-level for them. The training observability (W&B, MLflow) is critical because loss curves are the main diagnostic.
- Continued pretraining (domain adaptation, 10B to 200B tokens): similar stack to cold-start pretraining but with smaller scale, often a single training-pod (32 to 256 GPUs). Megatron or torchtitan still appropriate; the high-level recipe libraries become viable if your data is already prepared.
- Supervised fine-tuning (SFT, 10M to 10B tokens): high-level recipe libraries (Axolotl, LLaMA-Factory) are the default. The foundation frameworks become optional (DeepSpeed ZeRO-3 or FSDP backend through Accelerate is enough). FlashAttention and bitsandbytes (for QLoRA) are still relevant. Single-node 8-GPU is the common scale; multi-node only for full fine-tunes of 70B+ models.
- Preference tuning (DPO / IPO / KTO / ORPO, 100K to 10M examples): TRL is the reference library, wrapped by Axolotl or LLaMA-Factory. Single-node 8-GPU is typical. Throughput is rarely the bottleneck; data quality dominates.
- Reinforcement learning (PPO / GRPO / RLOO, ongoing rollouts): the most complex stack because rollouts (fast generation) and updates (gradient steps) have different optimal libraries. OpenRLHF, TRL with vLLM rollout backend, or DeepSpeed-Chat are the canonical setups. Ray Train often coordinates the rollout workers and the trainer.
- Distillation (variable, often a long SFT or RL run): the data-generation phase uses the inference stack (vLLM, SGLang), the training phase uses an SFT or preference-tuning stack. The distillation-specific concern is throughput of the teacher model; a serving-grade inference engine is essential.
61.2.12 JAX, Pytorch / XLA, and the parallel ecosystems
The PyTorch-dominant stack described above is the 2026 mainstream, but a parallel JAX ecosystem exists, particularly strong on TPUs and in Google-adjacent research. The JAX-side libraries:
- JAX (Google, 2018): the core JAX library providing functional / pure tensor programming with automatic differentiation, jit compilation, and parallelism transforms (pmap, pjit, shard_map).
- Flax and Haiku: the neural network libraries on top of JAX; Flax is the broader Google standard, Haiku is the DeepMind reference.
- Optax: the canonical JAX optimizer library, providing the Adam / AdamW / Lion / etc. optimizers as JAX-compatible transforms.
- MaxText (Google, 2024): Google's reference JAX-based LLM training framework, designed for TPU pods. Like torchtitan on the PyTorch side, MaxText is the canonical reference implementation.
- Haliax (Stanford CRFM): the named-tensor library underlying Levanter, providing axis-name-based tensor operations that compose cleanly with JAX's parallelism transforms.
The PyTorch / XLA bridge (PyTorch code compiled to XLA for TPU) exists but the experience is generally not as smooth as native JAX. Pick PyTorch when your team is CUDA-first and the broader ecosystem advantage matters; pick JAX when TPU is the primary hardware target or when functional programming idioms fit your team better.
What's Next?
In the next section, Section 61.3: Datasets and Benchmarks, we build on the material covered here.