Section 61.1

Platforms

"Hour 47 of an 8,192-GPU run is when you learn whether you bought a training platform or rented a very expensive lesson."

ScaleScale, Cluster-Babysitting AI Agent
Big Picture

A "platform" for LLM training and scale-out inference is the layer that allocates accelerators, schedules jobs across nodes, mounts the multi-petabyte dataset, and gives you observability when an 8,192-GPU run crashes at hour 47. The 2026 landscape splits along four axes. First, hyperscaler clouds (AWS, GCP, Azure, OCI) that wrap HPC-style infrastructure inside their managed-ML services and bill per-GPU-hour. Second, specialized GPU clouds (CoreWeave, Lambda, RunPod, vast.ai, Together, Modal, Fly.io, Cloudflare AI) that strip away the hyperscaler abstractions to give you raw GPUs with InfiniBand at materially lower cost. Third, HPC schedulers (Slurm, Torque, LSF, plus Kubernetes-based Volcano, KubeRay, Argo, Kubeflow) that run on top of any of the above and provide the queue, gang-scheduling, and fault-tolerance semantics frontier training needs. Fourth, the orthogonal stack: parallel storage (Lustre, GPFS, Weka, BeeGFS, FSx) and training observability (Weights and Biases, MLflow, Comet, Aim) without which any of the compute platforms is unusable. Frontier labs (Anthropic, OpenAI, DeepMind, xAI, Meta) build proprietary stacks on top of these primitives; the rest of us assemble.

The platform choice for an LLM training run is more consequential than the equivalent for inference, because the failure modes are different. An inference platform that drops 0.1% of requests is annoying. A training platform that drops 0.1% of nodes during a 60-day pretraining run, without good checkpoint and restart semantics, can burn millions of dollars in lost work. The dimensions that matter at scale are: interconnect (InfiniBand HDR / NDR or NVLink within a node, plus 400G+ Ethernet RoCE between nodes), gang-scheduling guarantees (all 1,024 GPUs start at once or not at all), checkpoint bandwidth (writing a 1TB checkpoint in seconds rather than hours), and the operational practices around hardware failures (GPUs at this scale fail daily; the platform either handles it or your team stays up all night).

61.1.1 Hyperscaler cloud platforms

The four hyperscaler clouds offer first-class managed training platforms aimed at frontier-scale jobs. They are the default for enterprises with existing cloud commitments, regulated workloads, and teams that want a one-vendor billing relationship. They are typically 30 to 60 percent more expensive per GPU-hour than specialized GPU clouds but bundle networking, storage, and observability into the price.

61.1.2 Specialized GPU clouds

Two cartoon GPU clusters side-by-side: on the left, an Ethernet cluster wired with thin droopy cables whose sad GPUs reach only 30% MFU, and on the right, an InfiniBand cluster wired with thick glowing cables whose happy GPUs hit 60% MFU.
Figure 61.1.1: The interconnect, not the GPU, often sets the throughput ceiling on 1024+ GPU jobs.

Specialized GPU clouds are the 2023-2026 alternative to the hyperscalers, offering raw GPUs at materially better cost-per-hour by stripping the managed-ML abstractions and operating with thinner margins. They are the default for serious training shops outside the hyperscaler-incumbent enterprises and for any team where compute bill is a primary cost driver.

Key Insight
The InfiniBand premium is the line that separates "training cluster" from "GPU pool"

The difference between a usable training cluster for a 70B+ model and a "pile of GPUs" is the inter-node interconnect. InfiniBand HDR (200 Gbps) or NDR (400 Gbps) per port, with non-blocking fat-tree topology, is required to get reasonable model-parallel scaling efficiency once you cross a single node. Many cheap GPU offers use 100 Gbps Ethernet or even slower fabrics, which works fine for single-node fine-tuning but collapses to 30 to 50 percent training efficiency at multi-node scale. When comparing GPU clouds, ask explicitly: "what is the inter-node interconnect, what is the topology (fat-tree, rail-optimized, etc.), and what is the bisection bandwidth?" The price gap between InfiniBand-equipped capacity and commodity-Ethernet capacity often reverses the headline-cost comparison once you measure model FLOPs utilization (MFU).

Key Insight
bisection bandwidth and the oversubscription tax

"Fat-tree" and "rail-optimized" are not marketing phrases; they have a definite mathematical meaning. The bisection bandwidth of a network is the worst-case aggregate bandwidth across any partition of nodes into two equal halves: $B_{\text{bisect}} = \min_{\text{cut}} \sum_{\text{links crossing cut}} B_{\text{link}}$. A non-oversubscribed fat-tree with $N$ endpoints and per-link bandwidth $B_{\text{link}}$ delivers $B_{\text{bisect}} = (N/2) \cdot B_{\text{link}}$ exactly, which means a cluster-wide all-reduce can complete at full per-link bandwidth regardless of which ranks pair up. A common cost-saving compromise is the 2:1 oversubscribed spine, where the uplink from each leaf to the spine carries only half the cross-sectional traffic the leaf could in principle send: $B_{\text{bisect}} = (N/4) \cdot B_{\text{link}}$. The implication for training: a cross-cluster all-reduce that crosses the oversubscribed spine halves its effective bandwidth and roughly doubles its step time. The 4:1 oversubscription common in older Ethernet AI clusters quarters it. Always ask "what is the worst-case bisection bandwidth in bytes/second" rather than "what is the link bandwidth in Gbps"; the two can differ by a factor of 4-8 on the same headline number.

61.1.3 Frontier-lab training infrastructure

Anthropic, OpenAI, Google DeepMind, Meta, and xAI run proprietary training stacks. Most details are non-public, but the broad architectural patterns are visible in vendor announcements, hardware procurement disclosures, and SEC filings.

61.1.4 HPC schedulers and orchestrators

Beneath the cloud or on-prem cluster, a scheduler decides which job gets which GPUs and how. The choice of scheduler shapes the developer workflow as much as the cloud choice does.

61.1.5 Parallel and high-throughput storage

Storage architecture is often the second-biggest performance variable after the GPU interconnect. A training job that stalls waiting on data is wasting hundreds of dollars per minute; a checkpoint write that takes 30 minutes versus 30 seconds is the difference between checkpointing every hour and every day.

61.1.6 Training observability platforms

A pretraining run is a long-running experiment whose health is measured by loss curves, gradient norms, throughput, and hundreds of derived metrics. The observability platform is the system-of-record for what happened during training and is consulted years later when someone asks "what hyperparameters did Llama-3-405B use?"

61.1.7 Mapping the platform stack

LLM training platform stack as of 2026 grouped into hyperscaler clouds, specialized GPU clouds, orchestration platforms, and experiment-tracking layers, with representative vendors in each tier
Figure 61.1.2: The 2026 LLM training platform stack, from hyperscaler clouds at the bottom up through schedulers, experiment trackers, and observability tools that together turn raw GPUs into shipping training runs.

61.1.8 Selection criteria and tradeoffs

Choosing a platform for training-scale work reduces to a small set of orthogonal axes, each of which independently rules out or recommends specific options:

Key Insight
The cluster you train on is rarely the cluster you serve on

A common 2026 production pattern is to train on a specialized GPU cloud (CoreWeave with InfiniBand, or HyperPod with reserved capacity) where high-bandwidth interconnect and large committed blocks make multi-week pretraining economical, then serve the trained model on a hyperscaler or edge GPU platform (Cloudflare Workers AI, Fly.io GPU, Together AI dedicated endpoints) where geographic distribution and per-request billing match inference traffic. This split decouples the procurement and operational characteristics of training (long jobs, periodic) from serving (short requests, continuous). Plan the inference cloud separately from the training cloud rather than assuming one provider for both.

Real-World Scenario
A 70B model fine-tune split across two platforms

A mid-2025 fine-tuning project for a 70B base model used a two-platform pattern. The team used 32 H100 nodes (256 GPUs) on CoreWeave for the fine-tune itself, sized at roughly 5 days of continuous training. Checkpoints streamed to S3 (via a Lustre intermediate cache on FSx for Lustre). After training, the model weights were transferred to Together AI Dedicated Endpoints for serving, where the team controlled per-region autoscaling and per-request billing. Total training cost was approximately $80k (CoreWeave H100 at $2.40/GPU-hour with the team's committed-capacity discount); the team estimated the equivalent on AWS on-demand would have been $130k. The serving-side platform was chosen for autoscaling rather than for cost, since the request rate was bursty. The architectural lesson: separating training and serving platforms is not duplicate effort; it is a price-performance optimization that pays back quickly.

61.1.9 Pricing shapes and the 2026 cost curve

GPU-hour prices have been falling steadily since the 2023 H100 peak. In May 2026 the indicative on-demand H100 80GB SXM5 prices are roughly $2.49 to $4.40 per GPU-hour on hyperscalers and $1.79 to $2.79 on specialized clouds; reserved 1-year commits drop another 30 to 45 percent. H200 is roughly 30 percent more expensive than H100; B200 / GB200 is roughly 60 to 90 percent more expensive but with 2 to 3x the throughput on most LLM workloads. The economically interesting consequence is that "rent for a long pretrain" remains feasible without owning hardware: a 70B model pretraining run on 256 H100s for 30 days is roughly $0.5M to $1M at 2026 specialized-cloud prices, comfortably inside the budget of a venture-funded company.

The four common pricing shapes in 2026:

61.1.10 Platform evaluation checklist

When evaluating a platform for an LLM training workload, the questions that surface real differences:

What's Next?

In the next section, Section 61.2: Libraries and Frameworks, we build on the material covered here.

Further Reading
AWS (2024). "Introducing Amazon SageMaker HyperPod." AWS News Blog. aws.amazon.com/blogs/aws/sagemaker-hyperpod. Launch reference for HyperPod and the canonical writeup of node-replacement, persistent-cluster training on AWS.
Meta Engineering (2024). "Building Meta's GenAI Infrastructure." Engineering at Meta, March 2024. engineering.fb.com/2024/03/12/building-metas-genai-infrastructure. Meta's disclosure of the two 24K-GPU H100 clusters and the RoCE-over-Ethernet versus InfiniBand design choice.
Barham, P. et al. (2022). "Pathways: Asynchronous Distributed Dataflow for ML." MLSys 2022. arxiv.org/abs/2203.12533. The Google Pathways paper, foundational reading on multi-pod TPU training and the abstractions behind Gemini-scale training.
Yoo, A.B., Jette, M.A., Grondona, M. (2003). "SLURM: Simple Linux Utility for Resource Management." JSSPP 2003. link.springer.com/SLURM-2003. The original Slurm paper; still the right starting reference for the gang-scheduling model that defines training-cluster job submission.
Biewald, L. (2020). "Experiment Tracking with Weights and Biases." Weights and Biases Docs. docs.wandb.ai/guides/track. Reference for the experiment-tracking primitives that became the de facto standard for LLM training logging.