
"The cluster is the budget; everything else is interpretation."
Scale, Capacity-Planning AI Agent
Part XI covered governance; Part XII covers the physical system that you have to govern. This chapter starts at the top: compute planning. GPU/TPU choice, cluster sizing, interconnect, power and cooling, the build-vs-rent decision, and the FLOPs accounting that turns scaling laws into a procurement spec.
Once the MVP works, you need to know what it costs to serve at the volume you actually expect. This chapter covers compute planning and infrastructure: sizing GPUs, picking an inference stack, integrating with enterprise systems, and avoiding the two failure modes (over-provision and starve).
The GPU sizing anchors are NVIDIA's H100, H200, and Blackwell (GB200, announced 2024, shipping 2025), with enterprise alternatives in AWS Trainium2 (GA Dec 2024) and Google TPU v5p (Dec 2023). On the inference stack, vLLM 0.6 (Sept 2024) and NVIDIA TensorRT-LLM are the two canonical open-source servers powering most 2026 production deployments.
Chapter Overview
Compute planning is a sizing exercise, not a procurement form. This chapter walks the full planning workflow: sizing infrastructure for the workload you will actually run (SLO-driven, not vendor-driven), enterprise integration patterns (identity, audit, networking, data-protection), GPU procurement strategy and spot-vs-reserved economics, and the cross-hardware performance benchmarking that keeps you portable.
Compute decisions made in the first month of a program shape the budget for the rest of it. This chapter is the practitioner's planning syllabus: the questions to ask, the numbers to run, and the decisions to push back against.
- Size LLM infrastructure from an SLO, model size, and traffic profile.
- Apply enterprise integration patterns (identity, audit, networking, data-protection) to an LLM deployment.
- Design a GPU procurement strategy that balances spot, reserved, and on-demand economics.
- Benchmark LLM performance across hardware platforms and maintain cross-vendor portability.
- Evaluate compute trade-offs across managed APIs, dedicated GPU clouds, and on-premises clusters.
Sections in This Chapter
Prerequisites
- 57.1 LLM Compute Planning & Infrastructure Compute planning starts before any GPU is rented: it is a sizing exercise driven by your serving SLO, model size, and traffic. Intermediate
- 57.2 Enterprise Integration Patterns for LLM Systems Enterprise integration is where LLM systems meet identity, audit, networking, and data-protection rules that have nothing to do with the model itself. Intermediate
- 57.3 GPU Procurement Strategy and Spot-Reserved Economics Sections 46.1 and 46.2 told you what hardware you need and how it integrates with the rest of the enterprise. Advanced
- 57.4 LLM Performance Benchmarking and Cross-Hardware Portability Previous sections in this chapter covered what to measure for model quality. Advanced
What Comes Next
In the next section, Section 57.1: LLM Compute Planning & Infrastructure, we get to work on the chapter's first concrete topic. After this chapter you continue to Chapter 69: Scaling Economics: Unit Costs & ROI.