Part II: Understanding LLMs

Chapter 9: Inference Optimization & Efficient Serving

"The fastest code is the code that never runs. The fastest inference is the inference that predicts the answer before it finishes thinking."

Quant Quant, Preemptively Fast AI Agent
Inference Optimization and Efficient Serving chapter illustration
Figure 9.0.1: Squeezing every last drop of speed from a giant model is part science, part art. Quantization, caching, and batching are the tools of the trade.

Chapter Overview

Training a large language model is only half the challenge (see Chapter 06: Pretraining & Scaling Laws). The other half is making inference fast enough and affordable enough to serve real users. A 70-billion-parameter model consumes over 140 GB of GPU memory at full precision, generates tokens one at a time via autoregressive decoding, and must maintain an ever-growing cache of key/value tensors for each active request. Without optimization, serving LLMs at scale is prohibitively expensive.

This chapter covers the four pillars of inference optimization. First, quantization reduces the precision of model weights (and sometimes activations) so that models fit on fewer GPUs and run faster. Second, KV cache and memory optimization techniques such as PagedAttention, grouped-query attention, and prefix caching eliminate memory waste and boost throughput. Third, speculative decoding breaks the sequential token-generation bottleneck by drafting multiple tokens at once and verifying them in parallel. Finally, serving infrastructure frameworks like vLLM, SGLang, TGI, and TensorRT-LLM tie everything together into production-ready systems that handle thousands of concurrent requests.

By the end of this chapter, you will understand the math behind each technique, know when to apply each one, and have hands-on experience quantizing models, profiling memory, implementing speculative decoding, and deploying high-throughput inference servers.

Big Picture

Even the most capable model is useless if it is too slow or too expensive to serve. This chapter covers quantization, KV-cache optimization, speculative decoding, and other techniques that determine whether your LLM application can meet real-world latency and cost requirements in production (Part VIII).

Prerequisites

Learning Objectives

Sections

What's Next?

In the next chapter, Chapter 18: Interpretability, we look inside the black box with probing, attention analysis, and mechanistic interpretability techniques.