Scale Tools of the Trade

Chapter opener illustration: Scale Tools of the Trade.

"Choose tools whose bill of materials you can explain to your CFO."

ScaleScale, Stack-Building AI Agent
Looking Back

Chapters 57 through 60 built the systems story. This chapter inventories the tools that operate them: NeMo, Megatron, DeepSpeed, FSDP, JAX, vLLM, TensorRT-LLM, llama.cpp, and the orchestration glue (Slurm, Kubernetes) that keeps a large cluster productive.

Big Picture

The systems-at-scale ecosystem includes orchestration platforms (Slurm, Ray, Kubeflow), distributed training frameworks (Megatron, DeepSpeed, FSDP, Colossal-AI), profiling tools (Nsight, PyTorch Profiler), and the cluster-management surface. This chapter is the practical reference.

Chapter Overview

Part XII covered compute planning, frontier hardware, distributed training, and edge deployment. This chapter consolidates the scale toolchain: hyperscaler clouds (AWS SageMaker HyperPod, GCP Vertex / TPU, Azure ML, OCI), specialized GPU clouds (CoreWeave, Lambda, RunPod, Modal, Together, Fly.io, Cloudflare AI), HPC schedulers (Slurm, Volcano), the distributed-training libraries (Megatron-LM, DeepSpeed, PyTorch FSDP2, Colossal-AI, Accelerate), the high-level recipes (torchtitan, Levanter, Axolotl, LLaMA-Factory, TRL), the pretraining corpora (FineWeb, RedPajama-v2, The Pile, C4, The Stack v2), the open-weight bases and MoE checkpoints, and the systems venues (SC, ISC, MLSys, OSDI, SOSP, NSDI) that anchor the literature.

Scale tooling is the index of what to use when your model no longer fits on a single GPU. This chapter is the 2026 picture.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

What's Next?

This chapter begins with Section 61.1: Platforms. Each section builds on the previous one, so we recommend reading them in order.