Chapter 61: Scale Tools of the Trade

Chapter opener illustration: Scale Tools of the Trade.

"Choose tools whose bill of materials you can explain to your CFO."
Scale, Stack-Building AI Agent

Looking Back

Chapters 57 through 60 built the systems story. This chapter inventories the tools that operate them: NeMo, Megatron, DeepSpeed, FSDP, JAX, vLLM, TensorRT-LLM, llama.cpp, and the orchestration glue (Slurm, Kubernetes) that keeps a large cluster productive.

Big Picture

The systems-at-scale ecosystem includes orchestration platforms (Slurm, Ray, Kubeflow), distributed training frameworks (Megatron, DeepSpeed, FSDP, Colossal-AI), profiling tools (Nsight, PyTorch Profiler), and the cluster-management surface. This chapter is the practical reference.

Chapter Overview

Part XII covered compute planning, frontier hardware, distributed training, and edge deployment. This chapter consolidates the scale toolchain: hyperscaler clouds (AWS SageMaker HyperPod, GCP Vertex / TPU, Azure ML, OCI), specialized GPU clouds (CoreWeave, Lambda, RunPod, Modal, Together, Fly.io, Cloudflare AI), HPC schedulers (Slurm, Volcano), the distributed-training libraries (Megatron-LM, DeepSpeed, PyTorch FSDP2, Colossal-AI, Accelerate), the high-level recipes (torchtitan, Levanter, Axolotl, LLaMA-Factory, TRL), the pretraining corpora (FineWeb, RedPajama-v2, The Pile, C4, The Stack v2), the open-weight bases and MoE checkpoints, and the systems venues (SC, ISC, MLSys, OSDI, SOSP, NSDI) that anchor the literature.

Scale tooling is the index of what to use when your model no longer fits on a single GPU. This chapter is the 2026 picture.

Note: Learning Objectives

Choose between hyperscaler clouds and specialized GPU clouds for a given workload and budget.
Configure HPC schedulers (Slurm, Volcano) for a multi-node training run.
Compare distributed-training libraries (Megatron-LM, DeepSpeed, PyTorch FSDP2, Colossal-AI) for a target model.
Apply high-level recipes (torchtitan, Levanter, Axolotl, LLaMA-Factory, TRL) to a production training pipeline.
Select pretraining and alignment corpora across FineWeb, RedPajama-v2, The Stack v2, UltraFeedback, and OASST.
Track the systems venues and engineering blogs that maintain the scale canon.

Sections in This Chapter

Prerequisites

Distributed training from Chapter 59
Inference optimization from Chapter 9
Comfort with at least one container or workload-manager system

What's Next?

This chapter begins with Section 61.1: Platforms. Each section builds on the previous one, so we recommend reading them in order.