GPU Procurement Strategy and Spot-Reserved Economics

Section 57.3

"Cost optimization is the disciplined refusal to spend money on capacity you do not need yet."

Werner Vogels, Amazon CTO, AWS re:Invent keynote, 2022
Big Picture

Sections 46.1 and 46.2 told you what hardware you need and how it integrates with the rest of the enterprise. This section is the procurement layer that sits underneath: where do you actually rent those GPUs, what does the price-per-flop landscape look like across the specialist providers and the hyperscalers, and how do you stitch together spot, reserved, and on-demand capacity into a portfolio that does not blow up the budget. The right answer in 2026 is almost never "all on-demand at one provider"; the right answer is a tiered procurement strategy that matches workload shape to contract shape. For LLM teams, the workload-to-contract mapping is sharper than in classical ML: fault-tolerant batch jobs (embedding refresh, evaluation suites, nightly fine-tuning) belong on spot, real-time inference SLAs belong on reserved or on-demand, and getting this wrong is the single largest line item in an LLM company's monthly burn.

Prerequisites

This section assumes familiarity with compute sizing from Section 57.1 and with enterprise integration patterns from Section 57.2. Familiarity with model-rotation strategy from Section 44.6 helps when reasoning about multi-provider tradeoffs.

The 2024-2025 GPU shortage taught one durable lesson: capacity at the right price is a procurement problem first and an engineering problem second. The team that locked in Lambda Labs reservations in early 2024 paid roughly half what their competitors paid on-demand for the same H100 hours that summer. The team that built a workload that could run on spot capacity from RunPod or vast.ai cut their training bill by another 60-70%. This section walks through the procurement landscape, the spot-instance economics that make it interesting, and the reserved-capacity playbook that protects against the next scarcity cycle.

57.3.1 The four procurement tiers

Fun Fact

Spot GPU prices on the major hyperscalers can drop by 70% during a public holiday and spike by 200% when a new frontier model launches and every researcher wants to run benchmarks. Teams that run their training jobs on spot capacity learn to read the AI news cycle the way commodity traders read weather reports.

The 2026 GPU market sorts into four procurement tiers, each with different price-per-hour and different operational guarantees. The tiers compound: most production deployments use three of the four in parallel, with workload-aware routing between them.

Key Insight
Specialist clouds priced 50-70% below hyperscalers for the same silicon

The pricing gap between Lambda Labs / CoreWeave / Modal and AWS / GCP / Azure for an identical H100 SXM5 is structural, not transient. Hyperscalers price for the bundle (network, IAM, S3 integration, support); specialists price for the GPU itself and assume you bring your own integration. If your workload is "PyTorch in a container talking to S3", the specialist is the right venue and saves 50-70%. If your workload is "deeply integrated with twelve other AWS services", the bundle premium is worth paying. Artificial Analysis tracks the spread monthly.

H100 SXM5 hourly prices across the four procurement tiers
Figure 57.3.1: The four GPU procurement tiers for H100 SXM5 silicon in mid-2026, with the structural ~7x price compression between AWS p5 on-demand at the high end ($8-$12) and the RunPod / vast.ai spot market at the low end ($1.20-$1.80). The specialist clouds (Lambda Labs, CoreWeave) sit 50-70% below the hyperscalers for identical hardware because they price the GPU and not the bundle. The recommended portfolio that the section describes pins 60-80% of the workload at the reserved tier and routes only the fault-tolerant top 5-10% to spot. The dashed midpoints on each bar are the typical contracted prices the worked example references.

57.3.2 Spot-instance economics for LLM workloads

Spot capacity (sometimes "preemptible" or "interruptible") is GPU time that the provider can reclaim at short notice (2-5 minute warning, sometimes immediate). The discount versus on-demand is steep, typically 60-80%. The tradeoff is eviction risk: any job that cannot resume cleanly from a checkpoint will lose work when the spot pool tightens.

LLM workloads split sharply on spot suitability. The fault-tolerant ones (batch inference, embedding generation, evaluation, hyperparameter sweeps) are nearly free on spot if you wire up checkpoint-resume correctly. The fault-intolerant ones (real-time inference SLAs, single-run training that cannot be restarted) need on-demand or reserved.

57.3.3 The reserved-capacity playbook

Reserved-capacity contracts are the right tool for the 70% of your workload that has a predictable floor. The 2024-2025 GPU shortage produced a generation of teams that learned the playbook the hard way: commit to a baseline that you are confident you will use, then layer spot and on-demand on top for the variable portion. The structure that works in 2026:

Real-World Scenario: A 2025-launched RAG-heavy startup

A retrieval-heavy B2B startup running mid-2025 served roughly 12M user queries/month, each costing ~$0.0008 in inference. Their procurement portfolio: 8 reserved H100s on CoreWeave for the steady-state baseline (~$2.40/hr each on a 90-day commit, total ~$14K/month), bursting to 4 additional on-demand H100s during US-business-hours peaks (~$3.50/hr, used 6 hours/day average), and routing nightly embedding-refresh jobs (~80M docs/month) to RunPod spot at $1.30/hr. Total monthly GPU cost: ~$22K against ~$95K MRR, giving a healthy 23% gross-margin contribution from compute. The equivalent setup on AWS p5 on-demand would have cost ~$70K/month, flipping the unit economics. The procurement strategy was worth roughly one senior engineer's salary saved.

Warning
reserved-capacity contracts can outlive your forecast

The most expensive procurement mistake in the 2024-2025 cycle was the team that signed a one-year H100 reservation in mid-2024 at "discounted" rates, then saw on-demand pricing drop 35% by Q1 2025 as supply caught up. They were locked in above market for nine months. The fix: limit your reservation horizon to 90 days unless you are confident pricing is going up, not down. Specialist providers are increasingly willing to write 30-day-renewable terms; use them. Multi-year hyperscaler contracts only make sense if you have firm multi-year revenue visibility, which almost no early-stage AI product does.

57.3.4 Comparing the procurement venues

Table 57.3.1a: 46.3.1 GPU procurement venues, mid-2026 (H100 SXM5 reference).
VenueTier$/hr (H100)TermBest for
AWS p5.48xlarge (on-demand)Hyperscaler$8-$12HourlyTight cloud integration
AWS p5 (1-yr reserved)Hyperscaler reserved$4-$51 yearSteady-state w/ AWS-dependent stack
CoreWeave on-demandSpecialist$3-$4HourlyTraining + production inference
CoreWeave reserved (30-day)Specialist reserved$2.20-$2.8030 daysPredictable baseline capacity
Lambda Labs On-Demand CloudSpecialist$2.49HourlyPure GPU training / batch
Lambda Reserved Cloud (3 mo)Specialist reserved$1.85-$2.203 monthsQuarterly training cycles
Modal (serverless)Specialist serverless$4.56 (active only)Per-secondBursty inference, zero idle
Together AI (batch)Specialist$2-$3 equiv.Per-tokenOpen-weight serving + batch
RunPod Secure CloudSpecialist$2.79HourlyContainerized batch + inference
RunPod Community SpotSpot$1.20-$1.80Hourly (preemptible)Fault-tolerant batch
vast.ai marketplaceSpot$1.50-$2.50Hourly (variable)Lowest-cost batch with checkpointing
Tip
maintain a multi-vendor account portfolio before you need it

Sign up for accounts at CoreWeave, Lambda Labs, Modal, RunPod, and the spot markets now, even if you only actively use one or two. The onboarding (KYC, billing setup, network configuration) takes days to weeks; the moment you actually need to fail over because your primary provider is out of H100s is the worst possible time to start that paperwork. Treat multi-vendor presence as cheap insurance, not as active multi-cloud spend. Most accounts cost nothing while idle.

57.3.5 Procurement as a continuous practice

Procurement is not a one-time exercise; pricing and capacity shift quarterly as new GPU generations ship, hyperscalers re-negotiate contracts with NVIDIA, and specialist providers expand their fleets. A production team should re-bid their GPU baseline every quarter: collect quotes from the top three or four candidates, model the workload against current pricing, and re-allocate if the spread has moved more than 15%. The mechanical cost of the re-bid is a half-day of engineering work; the savings on a six-figure monthly compute bill compound.

Section 57.4 closes the chapter with the breakeven analysis between API and self-hosting. Section 68.1 picks up the ROI side of the same coin: once you know what compute costs, how do you measure what the LLM dollar buys in business value.

Key Insight

The right GPU procurement strategy in 2026 is a tiered portfolio: reserve your steady-state floor at a specialist (CoreWeave, Lambda Labs) on 30-90-day terms, bridge predictable bursts with on-demand at the same specialist or a hyperscaler, and route fault-tolerant batch work to spot markets (RunPod, vast.ai). The pricing gap between hyperscalers and specialists is structural and worth 50-70% on the GPU line alone; the gap between on-demand and spot is another 60-80% on fault-tolerant workloads. Re-bid quarterly; keep multi-vendor accounts warm; never sign a multi-year hyperscaler reservation without firm multi-year revenue visibility.

What's Next

Procurement decides what you pay; performance benchmarking decides whether the cheaper GPU actually saves money once latency and throughput are accounted for. Continue to Section 57.4: LLM Performance Benchmarking and Cross-Hardware Portability.

Further Reading

GPU Economics

Cottier, B., Rahman, R., Fattorini, L., et al. (2024). "The rising costs of training frontier AI models." Epoch AI. arXiv:2405.21015. Empirical study of frontier-model training costs; the right reference for benchmarking GPU procurement decisions.
Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350. Compute-cost methodology that translates GPU-hour pricing into total project budget.

Spot and Reserved Markets

AWS (2024). "Amazon EC2 Spot Best Practices." docs.aws.amazon.com/AWSEC2/spot-best-practices. The canonical reference for spot-instance procurement; the math applies to all spot markets.
SF Compute (2024). "Auction-Based Compute Procurement." sfcompute.com. 2024 reference for the multi-GPU spot-auction model that has reshaped academic training economics.

Industry Reports

Stanford HAI (2024). "AI Index Report 2024." aiindex.stanford.edu/report. Annual benchmark on training-compute costs by model size; the standard procurement-planning reference.