Section 9.5: Serving Stack & vLLM Deep Dive

Serving an LLM in production is 10% machine learning and 90% convincing the infrastructure that thousands of users can share one very expensive GPU without anyone noticing the wait.
Quant, Queue Wrangling AI Agent

Big Picture

From model weights to production endpoint. A trained model is just a collection of tensors on disk. Turning it into a responsive, scalable API requires specialized serving infrastructure that handles continuous batching, KV cache management, request scheduling, model parallelism, and hardware-specific kernel optimization. For production deployment patterns and safety considerations, see Chapter 47. Building on the quantization (Section 9.1), KV cache (Section 9.3), and speculative decoding (Section 9.4) techniques covered earlier in this chapter, this section surveys the major serving frameworks available today, explains their architectural differences, and provides practical guidance for choosing the right tool for your deployment scenario. We conclude with a benchmarking methodology so you can make data-driven decisions for your own workloads.

Prerequisites

This section ties together all previous Chapter 9: Inference Optimization & Efficient Serving concepts: quantization (Section 9.1), KV cache management (Section 9.3), and speculative decoding (Section 9.4). Serving frameworks combine these techniques. Understanding of continuous batching requires familiarity with autoregressive generation from Section 4.1.

9.5.1 The Serving Stack

Key Insight

Why: PagedAttention is virtual memory for the KV cache

PagedAttention works because the KV cache has the same access pattern as virtual memory: many concurrent requests of unknown final length, each growing one page at a time, all needing to share a fixed physical pool. Kwon et al. (2023) explicitly modeled this on the OS paging algorithms from the 1960s. The 2-4x throughput gain comes from eliminating two waste sources: internal fragmentation (preallocating worst-case KV lengths per request) and external fragmentation (free slots stranded between active requests). Stating that PagedAttention is "virtual memory for the KV cache" (not just an LLM-specific trick) lets readers transfer all their OS-paging intuitions (copy-on-write, swapping, page-sharing for prefix caching) directly.

A restaurant kitchen metaphor for the LLM serving stack with load balancers as hosts, inference engines as chefs, and APIs as waiters — **Figure 9.5.1**: The serving stack is a restaurant kitchen: the load balancer seats guests, the inference engine cooks, and the API layer serves the dishes.

An LLM serving system sits between the raw model weights and the HTTP API that clients consume. It manages several critical responsibilities that are absent from a naive model.generate() call:

Request scheduling: Ordering incoming requests, managing priority queues, and enforcing rate limits.
Continuous batching: Dynamically adding and removing sequences from the active batch at each iteration (as covered in Section 9.3).
KV cache management: Allocating, sharing, and evicting KV cache blocks using PagedAttention or similar techniques.
Model parallelism: Distributing model weights across multiple GPUs using tensor parallelism (splitting layers) or pipeline parallelism (splitting stages).
Kernel optimization: Using hardware-specific fused kernels (FlashAttention, custom GEMM) to maximize throughput.
API layer: Exposing an OpenAI-compatible HTTP endpoint with streaming support (the client-side perspective is covered in Section 11.1).

Real-World Scenario

Benchmarking Serving Frameworks for a Multi-Model Gateway

Who: A DevOps engineer at an AI platform company building a unified API gateway that routes requests to different models based on task type.

Situation: The gateway needed to serve three models simultaneously (Llama-3.1 8B for chat, Llama-3.1 70B for complex reasoning, and a code-specialized 13B model), sharing a pool of 16 A100 GPUs.

Problem: Using Hugging Face's text-generation-inference (TGI) for all models resulted in suboptimal GPU utilization (average 35%) and high P99 latency (8.2 seconds) during peak traffic, because each model instance reserved GPUs exclusively.

Dilemma: vLLM offered better throughput via PagedAttention but lacked built-in multi-model support at the time. SGLang promised even higher throughput with RadixAttention but was newer and less battle-tested. TensorRT-LLM offered maximum single-model performance but required NVIDIA-specific compilation.

Decision: They deployed vLLM with separate instances per model, fronted by a custom routing proxy that directed requests based on task type and managed GPU allocation dynamically.

How: The team ran a structured benchmark: 1,000 requests per model with mixed prompt lengths (128 to 4,096 tokens) and output lengths (64 to 2,048 tokens), measuring throughput (tokens/second), time-to-first-token (TTFT), and P99 latency across vLLM, TGI, and SGLang.

Result: vLLM achieved 2.1x higher throughput than TGI for the 70B model and 1.4x for the 8B model. P99 latency dropped from 8.2 to 3.1 seconds. GPU utilization rose to 62%. SGLang matched vLLM on throughput and showed 15% better prefix cache hit rates for the chat model.

Lesson: Always benchmark serving frameworks on your actual workload distribution (prompt lengths, output lengths, concurrency), not just on synthetic benchmarks. The best framework depends on your specific traffic patterns.

Figure 9.5.2: The layers of an LLM serving stack, from HTTP clients down to GPU hardware. Each framework implements these layers with different design choices.

9.5.2 vLLM

Fun Fact

vLLM went from a UC Berkeley research project to the default serving framework for the open-source LLM community in under a year. Its PagedAttention paper was published in 2023, and by 2024 it was serving models at companies ranging from startups to Fortune 500 firms. In the LLM world, adoption timelines are measured in months, not years.

vLLM (Kwon et al., 2023) is the most widely adopted open-source LLM serving framework. Its core innovation is PagedAttention (covered in Section 9.3), which enables near-zero-waste KV cache memory management. Beyond PagedAttention, vLLM provides continuous batching, tensor parallelism, prefix caching, speculative decoding support, and an OpenAI-compatible API server. By 2026 vLLM ships chunked prefill, prefix caching, and disaggregated prefill/decode (separating the compute-bound prefill phase from the memory-bound decode phase onto different GPUs) by default; SGLang and TensorRT-LLM provide comparable feature sets.

The throughput advantage of vLLM over a naive serving stack is best understood through the KV-cache memory budget. Per request, a transformer with $L$ layers, $H$ KV heads per layer, head dimension $d_h$, sequence length $T$, and $b$ bytes per element (e.g., $b = 2$ for FP16) consumes

$$\text{KV bytes} \;=\; 2 \cdot L \cdot H \cdot d_h \cdot T \cdot b$$

bytes (factor 2 for keys and values). For Llama-3 8B at $T = 4096$ this is roughly 2 GB per request; with GQA-grouped heads the constant drops, but per-request KV still dominates GPU memory. Naive batching pre-allocates the maximum context length for every slot, wasting up to 60-80% of memory on padding. PagedAttention treats KV as paged virtual memory in fixed-size blocks (16 tokens by default), allocating blocks only as the sequence grows; the resulting fragmentation falls below 4%, so a vLLM instance can pack roughly $4\times$ more concurrent requests onto the same GPU than a static-allocation server.

Key features:

PagedAttention: Block-based KV cache allocation with copy-on-write for shared prefixes.
Continuous batching: Iteration-level scheduling that adds new requests as soon as GPU slots become available.
Quantization support: GPTQ, AWQ, FP8, and GGUF formats.
Tensor parallelism: Splits model layers across GPUs for large models.
OpenAI-compatible API: Drop-in replacement for OpenAI endpoints with /v1/completions and /v1/chat/completions.

Library Shortcut: vLLM PagedAttention + tensor parallel

Spinning up vLLM is a one-line CLI launch or four lines of Python. Pass tensor_parallel_size to shard across multiple GPUs, and enable_prefix_caching=True to share KV blocks across requests with overlapping prompts. The OpenAI-compatible server makes vLLM a drop-in replacement for OpenAI clients during local development.

Show code

pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct",
          tensor_parallel_size=4, enable_prefix_caching=True,
          gpu_memory_utilization=0.9)
out = llm.generate(["Write a haiku about GPUs."], SamplingParams(max_tokens=64))

Code Fragment 9.5.1a: Four lines of vLLM serve a 70B model across four GPUs with PagedAttention and prefix caching on.

# Example 1: Launch vLLM server and benchmark throughput
# Terminal: Start the vLLM server
# $ python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --dtype float16 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.90 \
# --enable-prefix-caching \
# --port 8000
# Python: Benchmark with concurrent requests
import asyncio
import aiohttp
import time
import json
async def send_request(session, url, prompt, max_tokens=128):
    payload = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.7,
        }
    t0 = time.perf_counter()
    async with session.post(url, json=payload) as resp:
        result = await resp.json()
        elapsed = time.perf_counter() - t0
        n_tokens = result["usage"]["completion_tokens"]
        return n_tokens, elapsed
        async def benchmark(concurrency=16, total_requests=64):
            url = "http://localhost:8000/v1/chat/completions"
            prompts = [
                "Explain quantum entanglement simply.",
                "Write a Python function to merge two sorted lists.",
                "What causes the northern lights?",
                "Describe the water cycle in four steps.",
                ]
            connector = aiohttp.TCPConnector(limit=concurrency)
            async with aiohttp.ClientSession(connector=connector) as session:
                tasks = [
                    send_request(session, url, prompts[i % len(prompts)])
                    for i in range(total_requests)
                    ]
                t_start = time.perf_counter()
                results = await asyncio.gather(*tasks)
                t_total = time.perf_counter() - t_start
                total_tokens = sum(r[0] for r in results)
                latencies = [r[1] for r in results]
                print(f"Concurrency: {concurrency}")
                print(f"Total requests: {total_requests}")
                print(f"Total time: {t_total:.2f}s")
                print(f"Total tokens: {total_tokens}")
                print(f"Throughput: {total_tokens/t_total:.1f} tok/s")
                print(f"Avg latency: {sum(latencies)/len(latencies):.2f}s")
                print(f"P50 latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
                print(f"P99 latency: {sorted(latencies)[int(0.99*len(latencies))]:.2f}s")
                asyncio.run(benchmark(concurrency=16, total_requests=64))

Output: Concurrency: 16 Total requests: 64 Total time: 12.47s Total tokens: 7891 Throughput: 632.7 tok/s Avg latency: 3.02s P50 latency: 2.87s P99 latency: 4.91s

Code Fragment 9.5.2a: Example 1: Launch vLLM server and benchmark throughput.

# Example 4: Comprehensive benchmarking script
import asyncio
import aiohttp
import time
import numpy as np
from dataclasses import dataclass
@dataclass
class RequestResult:
    ttft: float # time to first token (seconds)
    total_time: float # end-to-end latency
    output_tokens: int # number of generated tokens
    tpot: float # time per output token (after first)
    async def benchmark_streaming(url, model, prompt, max_tokens=128):
        """Measure TTFT and TPOT via streaming response."""
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "stream": True,
            }
        t0 = time.perf_counter()
        ttft = None
        token_count = 0
        async with aiohttp.ClientSession() as session:
            async with session.post(url, json=payload) as resp:
                async for line in resp.content:
                    decoded = line.decode().strip()
                    if decoded.startswith("data: ") and decoded != "data: [DONE]":
                        if ttft is None:
                            ttft = time.perf_counter() - t0
                            token_count += 1
                            total = time.perf_counter() - t0
                            decode_time = total - (ttft or total)
                            tpot = decode_time / max(token_count - 1, 1)
                            return RequestResult(ttft=ttft or 0, total_time=total,
                                output_tokens=token_count, tpot=tpot)
                            async def run_benchmark(concurrency_levels=[1, 4, 8, 16, 32]):
                                url = "http://localhost:8000/v1/chat/completions"
                                model = "meta-llama/Llama-3.1-8B-Instruct"
                                prompt = "Describe the process of photosynthesis."
                                print(f"{'Concur':>7} {'Throughput':>11} {'Avg TTFT':>10} "
                                    f"{'P50 TTFT':>10} {'Avg TPOT':>10} {'P99 Lat':>10}")
                                print("-" * 62)
                                for c in concurrency_levels:
                                    n_requests = max(c * 4, 16)
                                    sem = asyncio.Semaphore(c)
                                    async def bounded_req():
                                        async with sem:
                                            return await benchmark_streaming(url, model, prompt)
                                            tasks = [bounded_req() for _ in range(n_requests)]
                                            t0 = time.perf_counter()
                                            results = await asyncio.gather(*tasks)
                                            wall_time = time.perf_counter() - t0
                                            total_tok = sum(r.output_tokens for r in results)
                                            ttfts = [r.ttft for r in results]
                                            lats = [r.total_time for r in results]
                                            tpots = [r.tpot for r in results]
                                            print(f"{c:>7} {total_tok/wall_time:>10.1f}/s "
                                                f"{np.mean(ttfts)*1000:>9.0f}ms "
                                                f"{np.percentile(ttfts, 50)*1000:>9.0f}ms "
                                                f"{np.mean(tpots)*1000:>9.0f}ms "
                                                f"{np.percentile(lats, 99):>9.2f}s")
                                            asyncio.run(run_benchmark())

Output: Concur Throughput Avg TTFT P50 TTFT Avg TPOT P99 Lat -------------------------------------------------------------- 1 48.3/s 87ms 85ms 21ms 2.78s 4 187.2/s 112ms 108ms 22ms 3.01s 8 351.6/s 189ms 175ms 23ms 3.42s 16 632.7/s 342ms 315ms 25ms 4.91s 32 894.1/s 687ms 621ms 30ms 7.23s

Code Fragment 9.5.3: Example 1: Launch vLLM server and benchmark throughput

Key Insight

The most common mistake teams make when choosing a serving framework is optimizing for the wrong metric. If your workload is latency-sensitive (real-time chat, code completion), optimize for time-to-first-token (TTFT) and per-token latency. If your workload is throughput-sensitive (batch processing, evaluation pipelines), optimize for total tokens per second across concurrent requests. These two objectives can conflict: techniques like continuous batching and large batch sizes increase throughput but also increase per-request latency. The benchmarking methodology in this section helps you measure the metrics that matter for your specific use case. For guidance on when to self-host versus use an API provider, see Section 13.3.

Exercise 9.5.1: Latency vs. throughput tradeoff on vLLM Coding

Spin up vllm serve meta-llama/Llama-3.1-8B-Instruct on a single GPU, then benchmark with vllm bench serve at concurrency levels {1, 4, 16, 64}. Plot TTFT (p50) and total tokens/sec on the same axes. Identify the concurrency level where TTFT crosses 500 ms; that is your operating ceiling for an interactive chat workload.

Answer Sketch

Typical 8B-on-A100 numbers: concurrency 1 gives TTFT ≈ 80 to 110 ms and 80 tok/s. Concurrency 16 gives TTFT ≈ 340 ms and 630 tok/s. Concurrency 64 gives TTFT ≈ 700+ ms and 900 tok/s but blows the interactive budget. The crossover above 500 ms TTFT typically lands between concurrency 24 and 32 on an A100; lower for shorter prompts, higher for longer. Operate just below the crossover for chat, above it for batch.

Exercise 9.5.2: KV cache memory calculation Conceptual

For Llama-3.1-8B (n_layers=32, n_kv_heads=8, head_dim=128, FP16), compute the KV cache memory per token, then per request for a 4096-token context. How many concurrent 4K-context requests fit in a 40 GB GPU after subtracting the 16 GB model weights?

Answer Sketch

Per token: 2 (K and V) x 32 layers x 8 KV heads x 128 dim x 2 bytes = 131,072 bytes ≈ 128 KB. Per 4096-token request: 128 KB x 4096 ≈ 512 MB. With 24 GB free (40 - 16), about 48 simultaneous 4K requests fit. This is the upper bound on vLLM's effective concurrency before PagedAttention reuse kicks in.

What's Next?

In the next part of this section, Section 9.6: Serving Runtimes: SGLang, TGI, TensorRT & Edge, the llm serving stack and a deep dive into vllm, the most widely deployed open-source llm serving framework.

Further Reading

Serving Frameworks & Systems

Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. The paper behind vLLM, the most widely deployed open-source LLM serving engine. Introduces PagedAttention for efficient KV cache management, enabling near-zero memory waste and request-level memory sharing.

Zheng, L. et al. (2024). "SGLang: Efficient Execution of Structured Language Model Programs." arXiv preprint arXiv:2312.07104. Introduces RadixAttention for automatic prefix caching and a domain-specific language for structured LLM programs. Particularly strong for workloads with shared system prompts and constrained generation patterns.

NVIDIA (2024). "TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference." NVIDIA's production-grade inference library that compiles LLMs into optimized TensorRT engines. Delivers the highest throughput on NVIDIA hardware through custom CUDA kernels, FP8 quantization, and in-flight batching.

Scheduling & Batching Research

Yu, G. et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. Introduces iteration-level scheduling (continuous batching) for LLM serving, allowing new requests to join a batch as soon as any request completes. The foundational system paper that all modern serving frameworks build upon.

Agrawal, A. et al. (2024). "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." OSDI 2024. Addresses the interference between prefill and decode phases in continuous batching through chunked prefills and stall-free scheduling. Useful for understanding how production systems balance latency and throughput.

Local & Edge Inference

Gerganov, G. et al. (2024). "llama.cpp: Inference of Meta's LLaMA model in pure C/C++." The foundational project for local LLM inference, supporting CPU, Apple Silicon, and consumer GPU execution with GGUF quantization formats. Powers Ollama and most consumer-facing local LLM applications.