Section 61.2

Libraries and Frameworks

"Megatron, DeepSpeed, FSDP, and FlashAttention walk into a config file; the model emerges three weeks later at 47 percent MFU and counts that as a win."

TensorTensor, Parallelism-Stack-Wiring AI Agent
Big Picture

Distributed training and scale-out inference are dominated by a stack of libraries that compose into a working system. At the foundation sit the parallelism implementations: Megatron-LM (tensor and pipeline parallel), DeepSpeed (ZeRO sharded data parallel), PyTorch FSDP (the in-tree successor to DeepSpeed ZeRO), and Colossal-AI (a kitchen-sink hybrid). A layer above sit the high-level training recipes that wrap those primitives: Hugging Face Accelerate, Axolotl, LLaMA-Factory, Levanter, torchtitan, TRL. A third layer covers the optimization kernels that determine whether you achieve 30 percent or 50 percent MFU: FlashAttention 2 / 3, xformers, bitsandbytes, NVIDIA Transformer Engine for FP8, MS-AMP, torch.compile. A fourth layer covers the communication primitives: NCCL, NVSHMEM, MSCCL, MPI, Gloo, UCX. A fifth covers orchestration: Ray Train, Skypilot, Composer, Flyte. And a sixth covers compilers: torch.compile / TorchDynamo / TorchInductor, TensorRT, OpenAI Triton. Most production training stacks use 8 to 15 of these libraries together; the choice of foundation parallelism library typically dictates the rest.

Distributed training in 2026 is mature in the sense that the basic primitives (data parallel, tensor parallel, pipeline parallel, ZeRO / FSDP, expert parallelism for MoE) are widely available and well-understood. It is immature in the sense that putting them together for a specific model on a specific cluster still requires substantial expertise: the difference between a naive multi-GPU run at 25 percent MFU and a tuned run at 50 percent MFU is roughly a doubling of cost, and the gap is hidden behind many small choices (which attention kernel, which dataloader, which gradient accumulation strategy, which microbatch size, which communication overlap pattern).

61.2.1 Foundation distributed training frameworks

The foundation layer implements the parallelism strategies that allow a model larger than a single GPU to train across many GPUs. Megatron-LM, DeepSpeed, FSDP, and Colossal-AI are the main options.

61.2.2 High-level training recipes and fine-tuning libraries

Above the foundation distributed-training layer sit libraries that wrap the primitives into ready-to-run recipes for common workloads (pretraining a 7B from scratch, fine-tuning Llama-3 70B with LoRA, RLHF on a small base).

61.2.3 Optimization kernels, attention, and mixed precision

The choice of attention kernel and quantization library determines whether your training run achieves 30 percent or 55 percent MFU on the same hardware. The 2026 stack is dominated by FlashAttention, xformers, bitsandbytes, NVIDIA Transformer Engine, and torch.compile.

Algorithm 61.2.1: Algorithm: the FlashAttention online-softmax recurrence

The trick behind FlashAttention (Dao et al. 2022, Eq. 7-9) is that the softmax of a sequence can be computed one block at a time without ever materializing the full $N \times N$ score matrix. The standard softmax of $\{s_1, \ldots, s_N\}$ is $\operatorname{softmax}(s_i) = e^{s_i - m} / \ell$ with $m = \max_i s_i$ and $\ell = \sum_i e^{s_i - m}$. For a block decomposition $S = S_1 \mid S_2 \mid \ldots \mid S_K$ the running state after block $i$ is $(m_i, \ell_i, O_i)$, updated by the recurrence:

$$m_i = \max\big(m_{\text{i-1}},\ m^{\text{block}}_{S_i}\big)$$

$$\ell_i = e^{m_{\text{i-1}} - m_i}\,\ell_{i-1} + e^{m^{\text{block}}_{S_i} - m_i}\,\ell^{\text{block}}_{S_i}$$

$$O_i = e^{m_{\text{i-1}} - m_i}\,O_{\text{i-1}} + e^{m^{\text{block}}_{S_i} - m_i}\big(P^{\text{block}}_{S_i} \cdot V_{\text{S_i}}\big)$$

where $m^{\text{block}}_{S_i}$ and $\ell^{\text{block}}_{S_i}$ are the local max and local denominator for block $i$, $P^{\text{block}}_{S_i}$ is the local un-normalized softmax, and $O_i$ is the running output. After processing all $K$ blocks, $O = O_K / \ell_K$ is the exact softmax-weighted output. The blocks $Q_i, K_j, V_j$ live in fast SRAM (a few KB on H100) while $Q, K, V$ stay in HBM, so the IO traffic is $O(N \cdot d)$ rather than $O(N^2)$ and the attention matrix is never materialized: memory drops from $O(N^2)$ to $O(N)$ at the cost of slightly more FLOPs from rescaling. FlashAttention 3 adds FP8 and asynchronous warp specialization to push utilization past 70% of peak on H100. The same recurrence underpins every long-context kernel of 2023-2026 (FlashDecoding, Ring Attention, Tree Attention, the paged-KV-cache implementations).

Library Shortcut: the thinnest viable training stack in 2026

If you want to spin up a serious training run with the minimum library count, the 2026 "thinnest viable stack" is four pieces: PyTorch FSDP2 (in-tree, no extra dependency, handles sharded data parallel and composes with TP/PP via DTensor); Megatron-LM or Megatron-Core (for the tensor-parallel kernels and the production-grade attention / FFN implementations); Weights & Biases (for run logging and the system-of-record); and Slurm (for gang-scheduling and queue management on the cluster). Optionally add FlashAttention 3 as a one-line drop-in for the attention kernel and NVIDIA Transformer Engine for FP8. This 4-to-6-library stack is what most academic-lab and frontier-lab teams converge on; the longer composition patterns of Section 61.2.7 are richer but the minimum is leaner than it appears at first read.

Show code
# Minimum viable distributed-training launch: FSDP2 + W&B + torchrun.
import torch, wandb
from torch.distributed._composable.fsdp import fully_shard
from torch.distributed.device_mesh import init_device_mesh

# 1. Build a device mesh and shard the model with FSDP2.
mesh = init_device_mesh("cuda", (torch.distributed.get_world_size(),))
model = build_my_transformer().cuda()
fully_shard(model, mesh=mesh)  # in-place sharding

# 2. Log to W&B as the run system-of-record.
if torch.distributed.get_rank() == 0:
    wandb.init(project="llm-pretrain", config={"world_size": mesh.size()})

# 3. Launch with: torchrun --nproc_per_node=8 --nnodes=$SLURM_NNODES script.py
Code Fragment 61.2.1: Minimum viable distributed-training launch: FSDP2 + W&B + torchrun.

61.2.4 Communication libraries

Underneath every distributed training framework sits a collective communication library that implements all-reduce, all-gather, reduce-scatter, and point-to-point sends and receives. NCCL dominates for NVIDIA, with NVSHMEM, MSCCL, MPI, Gloo, and UCX as alternatives for specific patterns.

61.2.5 Orchestration and ML pipeline frameworks

Above the training-framework layer sit orchestrators that manage multi-stage ML pipelines (data prep, train, eval, deploy), multi-cluster compute (run on this cloud when capacity is cheaper), and multi-experiment scheduling.

61.2.6 Compilers and runtime optimization

The compiler layer transforms model definitions into optimized kernels for specific hardware. The 2026 stack is dominated by torch.compile / TorchInductor, OpenAI Triton, and TensorRT for inference.

61.2.7 Common stack composition patterns

Real production stacks compose several of these libraries together. The 2026 patterns:

Key Insight
MFU is the right single number; the libraries are how you get there

Model FLOPs Utilization (MFU) is the standard metric: (achieved FLOPs/second) divided by (theoretical peak FLOPs/second of the hardware), reported as a percentage. A well-tuned dense transformer pretraining run achieves 40 to 55 percent MFU on H100; a poorly-tuned run drops to 15 to 25 percent. The gap is closed almost entirely by library choices: FlashAttention versus naive attention is the biggest single lever (2 to 4x speedup), Transformer Engine FP8 versus BF16 is the second (1.5 to 2x on H100, more on B200), torch.compile is the third (10 to 30 percent), the right communication library and overlap pattern is the fourth (5 to 15 percent at multi-node scale). Track MFU on every training run and treat any drop below 40 percent on a frontier GPU as a bug to investigate.

61.2.8 Mapping the library stack

LLM training library stack as of 2026 layered from high-level recipes through foundation distributed-training frameworks down to attention, fused-op, and communication primitives
Figure 61.2.1a: The 2026 LLM training library stack: core frameworks (PyTorch, JAX), distributed engines (FSDP, DeepSpeed, Megatron), and orchestration layers that push MFU toward its theoretical ceiling.

61.2.9 Library evaluation checklist

When picking a library for a new training project, the questions that surface real differences:

Real-World Scenario: Choosing FSDP versus DeepSpeed in 2026

A 2025 research team was choosing between PyTorch FSDP2 and DeepSpeed ZeRO-3 for a 7B-to-13B-scale pretraining experiment on 32 H100 nodes. The team's evaluation: (a) FSDP2 had narrowed or closed the historical performance gap to DeepSpeed for their model size; (b) FSDP2 composed more cleanly with torch.compile, while DeepSpeed had known sharp edges with the PyTorch compiler stack; (c) FSDP2 was in-tree and tracked PyTorch nightlies, while DeepSpeed required pinning. The team picked FSDP2 and reported a 5 percent MFU win over their DeepSpeed baseline once torch.compile was enabled. The lesson generalizes: for new projects at small-to-mid scale, FSDP2 is the default; for very large dense models and for projects already invested in DeepSpeed-MoE / DeepSpeed-Ulysses, DeepSpeed retains specific advantages.

61.2.10 The FP8 and low-precision training shift

The biggest 2024-2026 shift in training libraries was the move from BF16 to FP8 mixed precision as the default on H100 / H200 / B200 hardware. The hardware support has been in place since H100 (2022), but the library and recipe maturity required for production FP8 training came in stages: NVIDIA Transformer Engine matured through 2023-2024, the FP8 reference recipes in Megatron-Core and torchtitan landed in 2024, and FP8 training is now the default for new frontier pretraining runs in 2026. The practical effect is roughly 1.7x to 2.2x training throughput on H100 versus BF16 with proper integration, and substantially more on B200 where the FP8 hardware is more advanced relative to BF16.

The library choices that matter for FP8 training: Transformer Engine (the reference, integrated into Megatron-Core and torchtitan), MS-AMP (Microsoft's alternative), and the in-tree PyTorch float8 module (the early-2024 experimental support that has been hardening through 2025-2026). For RTX 5090 / B200 / GB200 deployments, even more aggressive precision (FP6, FP4 for inference, MXFP4 for both) is becoming production-relevant; the library support is rapidly evolving and worth tracking on a quarterly cadence.

61.2.11 Stack composition by training stage

The training stage determines which subset of the library stack actually matters. The 2026 layered pattern:

61.2.12 JAX, Pytorch / XLA, and the parallel ecosystems

The PyTorch-dominant stack described above is the 2026 mainstream, but a parallel JAX ecosystem exists, particularly strong on TPUs and in Google-adjacent research. The JAX-side libraries:

The PyTorch / XLA bridge (PyTorch code compiled to XLA for TPU) exists but the experience is generally not as smooth as native JAX. Pick PyTorch when your team is CUDA-first and the broader ecosystem advantage matters; pick JAX when TPU is the primary hardware target or when functional programming idioms fit your team better.

What's Next?

In the next section, Section 61.3: Datasets and Benchmarks, we build on the material covered here.

Further Reading
Shoeybi, M. et al. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv preprint arXiv:1909.08053. arxiv.org/abs/1909.08053. The original Megatron paper introducing tensor parallelism; foundational reading for distributed LLM training.
Rajbhandari, S. et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20. arxiv.org/abs/1910.02054. The DeepSpeed ZeRO paper; the canonical reference for ZeRO-1/2/3 partitioning that defines fully-sharded data parallel today.
Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arxiv.org/abs/2205.14135. Original Flash Attention paper; the canonical reference for IO-aware attention kernels that anchor the modern training library stack.
Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." arXiv preprint arXiv:2407.08608. arxiv.org/abs/2407.08608. The 2024 Flash Attention 3 paper extending to FP8 and H100-specific asynchrony; the current reference for state-of-the-art attention.
Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS 2023. arxiv.org/abs/2305.14314. The QLoRA paper introducing 4-bit NormalFloat and paged optimizers, the foundation of bitsandbytes-based fine-tuning.
Meta PyTorch (2024). "torchtitan: A native PyTorch library for large model training." arXiv preprint arXiv:2410.06511. arxiv.org/abs/2410.06511. The reference paper for torchtitan, the PyTorch-native pretraining framework demonstrating 4D parallelism with in-tree primitives.