Platforms

Section 10.6

"Every platform promises to make serving a 70B model easy. The one that wins is the one that admits it never gets easier, only different."

DeployDeploy, Platform-Weary AI Agent
Big Picture

Part II's platform question shifts from "where do I run a 100-million-parameter model" to "where do I run a 70-billion-parameter LLM and still pay rent". This section catalogs the inference platforms (vLLM, the open-source PagedAttention server; TGI, Hugging Face's Text Generation Inference; TensorRT-LLM, NVIDIA's compiled-graph runtime; plus the managed services Together, Anyscale, and Modal) that have consolidated around the open-weights LLM stack in 2026, and it tells you which platform fits which workload shape, from local-laptop experimentation to multi-region agentic RAG production.

Prerequisites

This section assumes you understand inference-time compute costs from Section 9.1 and the KV cache mechanics from Section 9.4. Quantization basics from Section 10.1 will help you compare platforms on like-for-like throughput. The open-versus-closed model split (covered later in the book) provides additional context.

Part II's platform question shifts from "where do I run a 100-million-parameter model" to "where do I run a 70-billion-parameter model when I do not own an H100". The answer is a layered one: Hugging Face Hub hosts the weights, hardware (your laptop, your consumer GPU, your cloud rental) hosts the inference, and a runtime layer (vLLM, llama.cpp, transformers) bridges the two. This section maps the platforms; the runtime layer is Section 16.2's job in Part III.

Three-layer LLM inference stack: hub plus hardware plus runtime
Figure 10.6.1: The three-layer 2026 LLM inference stack. Layer 1 (hub) is dominated by Hugging Face Hub, with ModelScope (Qwen3 in China), the GGUF registry for llama.cpp, and Meta's Llama portal for gated releases. Layer 2 (hardware) maps cleanly to model sizes: 4-bit 13B on a laptop CPU via bitnet.cpp; 32B on a 64-GB M-series via MLX-LM; 4-bit 70B on a 24-GB consumer GPU (RTX 3090/4090/5090); fp16 70B on a single A100 or H100 80-GB (about $1.99/hr on vast.ai spot); MoE or larger models on multi-GPU clusters, with DeepSeek-V3 expert-parallel across 4x consumer cards as a popular 2025 pattern. Layer 3 (runtime) is dominated by vLLM, with SGLang ascendant for structured output, TensorRT-LLM for NVIDIA, TGI for Hugging Face shops, and llama.cpp for CPU and Apple Silicon.

10.6.1 The Hub layer

Fun Fact

The phrase "renting an H100" hides a small economic miracle. The same hour of H100 time that costs $1.99 on Lambda spot can cost over $12 on the on-demand tier of a hyperscaler, with the underlying silicon being identical down to the serial number. The price gap is mostly insurance against your job being killed mid-batch.

Hugging Face Hub is the canonical model registry: every open-weight model in this chapter is downloadable through it. The Hub also hosts model cards, evaluation results, and discussions, which is most of the metadata you need to decide whether a model is worth your storage budget. HF Spaces hosts free Gradio / Streamlit demos for many models; useful for "try before you download".

Competing hubs exist but matter less in practice: ModelScope (Alibaba's hub, primary host for Qwen3 in China), llama.cpp's GGUF registry on Hugging Face (the quantization format used by Ollama and llama.cpp), and Meta's Llama portal for license-gated official releases.

10.6.2 The hardware tier you actually need

10.6.3 Cloud GPU rentals

The cloud-rental market has consolidated around three tiers. The first is "managed GPU notebook" (Colab Pro, Lightning Studios): hourly billing, no SSH, easy to spin up. The second is "rented bare-metal GPU" (vast.ai, with the "verified hosts" filter to avoid the worst-quality nodes; RunPod; Lambda Labs): per-second billing, SSH, much cheaper per GPU-hour. The third is "serverless GPU" (Modal, Replicate): pay per request, no provisioning. Alternative high-throughput inference providers worth knowing alongside Groq are TensorWave (AMD MI300X) and SambaNova (RDU dataflow chips), both of which serve Llama / DeepSeek at competitive token / second. For Part II experimentation, vast.ai and RunPod are the right defaults; Modal and Replicate make more sense for production in Part III.

SGLang (sgl-project/sglang, 2024-25) emerged in 2025 as the third real inference runtime alongside vLLM and TGI. It often beats vLLM on structured-output and constrained-decoding workloads thanks to its RadixAttention prefix-cache design. Try it whenever your workload involves heavy JSON / regex constraints.

10.6.4 Comparing the rental platforms

Table 10.6.1a: 12.1.1 GPU platforms for Part II experimentation.
PlatformTypeApprox H100 $/hrBest for
vast.aiSpot bare-metal$1.80-$3Cheapest H100 / A100 by far
RunPodBare-metal + serverless$2.50-$4Reliable spot, S3 network volumes
Lambda LabsReserved GPUs$2.50-$5Multi-node training, reservation
ModalServerlessper-second, $3.95 H100Batch jobs with no provisioning
AWS p5Hyperscaler reserved$5-$8Enterprise, audited workloads
Warning: spot instances can vanish mid-run

vast.ai and RunPod spot instances are cheap because they can be reclaimed at any moment. For Part II inference work this is fine (restart the script). For multi-hour training, checkpoint to a network volume every few minutes or you will lose work to an interruption. The gpu2runpod and gpu2vast scaffolds in this book's scripts/ directory automate the resume-on-interrupt pattern.

Warning
Don't trust a vast.ai instance until nvidia-smi agrees with the listing

vast.ai listings sometimes report GPUs that the container can't actually see (incompatible CUDA driver, misconfigured passthrough). Always run nvidia-smi and a one-batch torch forward pass before launching a long job. RunPod has egress charges that Modal does not, which often flips the cost comparison once you start moving checkpoints; budget egress as a line item.

Tip
prefer a small model on your own hardware over a big one in the cloud

For learning, "I can run this model whenever I want" beats "I can run a bigger model when I am willing to pay". A 4-bit Qwen3-7B on your laptop will teach you more about LLM behavior than three rented H100 sessions, simply because you will run more experiments. Reserve cloud GPUs for the few experiments where size of model is the variable you are studying.

10.6.5 Default recommendation

For Part II: own GPU first (6 GB or higher), Hugging Face Hub for downloads, vast.ai or RunPod when you need an A100 / H100, llama.cpp or vLLM as the runtime depending on whether you want CPU/Apple-Silicon support (llama.cpp) or GPU throughput (vLLM). Move to managed platforms (Modal, Replicate, Bedrock) only when your work transitions from "I am studying this" to "I am shipping this".

Key Takeaways

What's Next?

In the next section, Section 10.7: Interpretability Tools & Transformers Deep Dive, we build on the material covered here.

Further Reading

GPU Rental Markets

Pope, R., Douglas, S., Chowdhery, A., et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. arXiv:2211.05102. Reference for GPU sizing across the inference Pareto frontier.
Hugging Face (2024). "Hugging Face Hub Documentation." huggingface.co/docs/hub. The canonical model registry; the primary hub layer of this section.

Hardware References

NVIDIA (2024). "Hopper Architecture Whitepaper (H100/H200)." resources.nvidia.com/en-us-tensor-core. The reference spec for the H100/H200 tier.
Apple (2024). "Apple Silicon Performance Documentation." developer.apple.com/documentation/metal. Reference for unified-memory Apple GPUs used for local LLM inference.