Building Conversational AI with LLMs and Agents
Appendix S: Inference Serving: vLLM, TGI, and SGLang

vLLM: PagedAttention, Continuous Batching, and OpenAI-Compatible API

Big Picture

vLLM is an open-source inference engine that achieves state-of-the-art serving throughput through PagedAttention and continuous batching (see Section 9.2 for the theory behind both techniques). Together, these deliver 2x to 24x higher throughput than naive HuggingFace inference. vLLM also exposes an OpenAI-compatible API, making it a drop-in replacement for cloud-hosted models.

Covered in Detail

For the theoretical foundations of PagedAttention, KV cache memory management, and continuous batching, see Section 9.2: KV Cache & Memory Optimization. This section focuses on the practical setup and usage of vLLM as a serving engine.

1. Installing and Running vLLM

vLLM supports Linux with CUDA 11.8 or later. The simplest installation path is through pip. The following command installs vLLM along with its dependencies.

# Install vLLM (requires CUDA 11.8+ and Linux)
pip install vllm

Once installed, you can verify the installation and check which GPU devices are visible.

import vllm
print(f"vLLM version: {vllm.__version__}")

from vllm import LLM
# This will print available GPU information during model loading
vLLM server started on http://localhost:8000 Model: meta-llama/Llama-3.1-8B-Instruct GPU memory usage: 15.2 GB / 24.0 GB Max batch size: 256

1.1 Offline (Batch) Inference

The simplest way to use vLLM is for offline batch inference, where you have a list of prompts and want to generate completions as fast as possible. The LLM class handles model loading, tokenization, and generation in a single interface.

from vllm import LLM, SamplingParams

# Load the model (downloads from HuggingFace on first run)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
    stop=["\n\n"],          # Stop generation at double newline
    presence_penalty=0.1,
)

# Batch of prompts to process
prompts = [
    "Explain the difference between TCP and UDP in one paragraph.",
    "Write a Python function to calculate the Fibonacci sequence.",
    "What are the three laws of thermodynamics?",
    "Translate 'Hello, how are you?' into French, German, and Japanese.",
]

# Generate completions (vLLM handles batching internally)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt:  {prompt[:60]}...")
    print(f"Output:  {generated[:200]}")
    print()
Response: {'id': 'cmpl-abc123', 'choices': [{'text': 'The capital of France is Paris.', 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 12, 'completion_tokens': 8}}

Behind the scenes, vLLM applies PagedAttention and continuous batching to process all four prompts concurrently, fully utilizing the GPU. On an A100 GPU, this batch completes roughly 8x faster than sequential model.generate() calls.

1.2 Sampling Parameters

The SamplingParams class provides fine-grained control over text generation. The table below summarizes the most commonly used parameters.

Parameter Default Description
temperature1.0Controls randomness; lower values produce more deterministic output
top_p1.0Nucleus sampling; considers tokens whose cumulative probability reaches this threshold
top_k-1Limits sampling to top-k most probable tokens (-1 disables)
max_tokens16Maximum number of tokens to generate
stopNoneList of strings that trigger generation to stop
presence_penalty0.0Penalizes tokens that have already appeared
frequency_penalty0.0Penalizes tokens proportional to their frequency
best_of1Generates N candidates and returns the one with highest log-probability
Figure S.1.2: Key sampling parameters in vLLM's SamplingParams class.

2. The OpenAI-Compatible Server

vLLM ships with a built-in HTTP server that mirrors the OpenAI Chat Completions and Completions API. This means you can point any application that uses the OpenAI Python SDK at your local vLLM instance by changing the base_url. No other code changes are needed.

To start the server, use the vllm serve CLI command.

# Start the OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --dtype auto
Benchmark results: Throughput: 1,245 tokens/sec Avg latency: 0.42s (p50), 0.89s (p99) Concurrent requests: 32 GPU utilization: 94%

Once the server is running, you can send requests using curl or any HTTP client. The following example demonstrates using the OpenAI Python SDK to chat with the locally hosted model.

from openai import OpenAI

# Point the OpenAI client at the local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM does not require an API key by default
)

# Use the standard Chat Completions API
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a binary search function in Python."},
    ],
    temperature=0.3,
    max_tokens=512,
)

print(response.choices[0].message.content)
Here's a binary search function in Python: ```python def binary_search(arr, target): left, right = 0, len(arr) - 1 while left <= right: mid = (left + right) // 2 if arr[mid] == target: return mid elif arr[mid] < target: left = mid + 1 else: right = mid - 1 return -1 ``` This function takes a sorted array and a target value...
Practical Example

Streaming is also supported. Replace client.chat.completions.create(...) with client.chat.completions.create(..., stream=True) and iterate over the response chunks. This gives users a token-by-token experience identical to the OpenAI API.

3. Model Loading and Configuration

vLLM can load models from HuggingFace Hub, local directories, or S3-compatible storage. The LLM constructor and the vllm serve CLI accept several important configuration flags that control memory usage and performance.

from vllm import LLM

# Load a quantized model with tensor parallelism across 2 GPUs
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-GPTQ",
    quantization="gptq",
    tensor_parallel_size=2,        # Shard across 2 GPUs
    gpu_memory_utilization=0.85,   # Reserve 15% for overhead
    max_model_len=4096,            # Maximum context length
    dtype="float16",               # Use FP16 for non-quantized layers
    trust_remote_code=True,        # Required for some custom models
)
Flag Description
gpu_memory_utilizationFraction of GPU memory to use for model weights and KV cache (0.0 to 1.0)
tensor_parallel_sizeNumber of GPUs for tensor parallelism; the model is sharded across them
max_model_lenMaximum sequence length; lower values free memory for larger batches
quantizationQuantization method: "gptq", "awq", "squeezellm", or None
enforce_eagerDisable CUDA graph capture; useful for debugging or variable-length workloads
swap_spaceCPU swap space in GB for offloading KV cache when GPU memory is exhausted
Figure S.1.3: Key vLLM configuration parameters for model loading and memory management.
Warning

Setting gpu_memory_utilization too high (above 0.95) can cause out-of-memory errors under bursty load, because the scheduler may attempt to admit more requests than the remaining KV cache space can accommodate. A value of 0.85 to 0.90 provides a good balance between throughput and stability.

4. Benchmarking vLLM Throughput

vLLM includes a built-in benchmarking script that measures tokens per second for both prefill (prompt processing) and decode (token generation) phases. The following command benchmarks the server with synthetic requests.

# Benchmark with 100 requests, input length 512, output length 128
python -m vllm.entrypoints.openai.api_server &

python -m vllm.benchmark_serving \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-prompts 100 \
    --input-len 512 \
    --output-len 128

Typical results on a single A100-80GB GPU for Llama 3.1 8B show throughput of approximately 2,000 to 3,500 output tokens per second with 32 concurrent requests, depending on sequence lengths and quantization settings. This compares to roughly 150 to 300 tokens per second with naive HuggingFace generate().

Summary

vLLM transforms LLM serving into a high-throughput system by combining PagedAttention and continuous batching (both covered in depth in Section 9.2) with an OpenAI-compatible API that makes it a drop-in replacement for cloud-hosted models, enabling local inference with zero code changes. In the next section, we examine HuggingFace's Text Generation Inference (TGI), which takes a different architectural approach to the same serving challenge.