Compute Planning & Infrastructure

Sizing infrastructure for the workload you'll actually run.

Chapter opener illustration: Compute Planning & Infrastructure.

"The cluster is the budget; everything else is interpretation."

ScaleScale, Capacity-Planning AI Agent
Looking Back

Part XI covered governance; Part XII covers the physical system that you have to govern. This chapter starts at the top: compute planning. GPU/TPU choice, cluster sizing, interconnect, power and cooling, the build-vs-rent decision, and the FLOPs accounting that turns scaling laws into a procurement spec.

Big Picture

Once the MVP works, you need to know what it costs to serve at the volume you actually expect. This chapter covers compute planning and infrastructure: sizing GPUs, picking an inference stack, integrating with enterprise systems, and avoiding the two failure modes (over-provision and starve).

The GPU sizing anchors are NVIDIA's H100, H200, and Blackwell (GB200, announced 2024, shipping 2025), with enterprise alternatives in AWS Trainium2 (GA Dec 2024) and Google TPU v5p (Dec 2023). On the inference stack, vLLM 0.6 (Sept 2024) and NVIDIA TensorRT-LLM are the two canonical open-source servers powering most 2026 production deployments.

Chapter Overview

Compute planning is a sizing exercise, not a procurement form. This chapter walks the full planning workflow: sizing infrastructure for the workload you will actually run (SLO-driven, not vendor-driven), enterprise integration patterns (identity, audit, networking, data-protection), GPU procurement strategy and spot-vs-reserved economics, and the cross-hardware performance benchmarking that keeps you portable.

Compute decisions made in the first month of a program shape the budget for the rest of it. This chapter is the practitioner's planning syllabus: the questions to ask, the numbers to run, and the decisions to push back against.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

What Comes Next

In the next section, Section 57.1: LLM Compute Planning & Infrastructure, we get to work on the chapter's first concrete topic. After this chapter you continue to Chapter 69: Scaling Economics: Unit Costs & ROI.

Further Reading

Scaling Laws & Compute Sizing

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., et al. (2020). "Scaling Laws for Neural Language Models." arXiv preprint. arXiv:2001.08361. The original scaling-law paper; defines the parameter-data-compute tradeoff curve every compute plan starts from.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556. The Chinchilla paper that revised Kaplan's coefficients toward roughly 20 tokens per parameter, the modern compute-budget baseline.

Serving and Inference Performance

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., et al. (2022). "Efficiently Scaling Transformer Inference." MLSys. arXiv:2211.05102. Google's analysis of latency and throughput frontiers for transformer inference; the basis for the SLO/cost roofline used in 57.1.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP. arXiv:2309.06180. The vLLM paper; standard reference for KV-cache memory planning that drives modern GPU sizing.

Cross-Hardware Benchmarking

MLCommons. (2024). MLPerf Inference v4.1 Benchmark Results. MLCommons MLPerf. The cross-vendor accelerator benchmark used for portability evaluation in 57.4; canonical reference for H100/H200/MI300X/TPUv5 comparisons.