vLLM is an open-source inference engine that achieves state-of-the-art serving throughput through PagedAttention and continuous batching (see Section 9.2 for the theory behind both techniques). Together, these deliver 2x to 24x higher throughput than naive HuggingFace inference. vLLM also exposes an OpenAI-compatible API, making it a drop-in replacement for cloud-hosted models.
For the theoretical foundations of PagedAttention, KV cache memory management, and continuous batching, see Section 9.2: KV Cache & Memory Optimization. This section focuses on the practical setup and usage of vLLM as a serving engine.
1. Installing and Running vLLM
vLLM supports Linux with CUDA 11.8 or later. The simplest installation path is through pip. The following command installs vLLM along with its dependencies.
# Install vLLM (requires CUDA 11.8+ and Linux)
pip install vllm
Once installed, you can verify the installation and check which GPU devices are visible.
import vllm
print(f"vLLM version: {vllm.__version__}")
from vllm import LLM
# This will print available GPU information during model loading
1.1 Offline (Batch) Inference
The simplest way to use vLLM is for offline batch inference, where you have a list of prompts and want
to generate completions as fast as possible. The LLM class handles model loading,
tokenization, and generation in a single interface.
from vllm import LLM, SamplingParams
# Load the model (downloads from HuggingFace on first run)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
stop=["\n\n"], # Stop generation at double newline
presence_penalty=0.1,
)
# Batch of prompts to process
prompts = [
"Explain the difference between TCP and UDP in one paragraph.",
"Write a Python function to calculate the Fibonacci sequence.",
"What are the three laws of thermodynamics?",
"Translate 'Hello, how are you?' into French, German, and Japanese.",
]
# Generate completions (vLLM handles batching internally)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"Prompt: {prompt[:60]}...")
print(f"Output: {generated[:200]}")
print()
Behind the scenes, vLLM applies PagedAttention and continuous batching to process all four prompts
concurrently, fully utilizing the GPU. On an A100 GPU, this batch completes roughly 8x faster than
sequential model.generate() calls.
1.2 Sampling Parameters
The SamplingParams class provides fine-grained control over text generation. The table
below summarizes the most commonly used parameters.
| Parameter | Default | Description |
|---|---|---|
temperature | 1.0 | Controls randomness; lower values produce more deterministic output |
top_p | 1.0 | Nucleus sampling; considers tokens whose cumulative probability reaches this threshold |
top_k | -1 | Limits sampling to top-k most probable tokens (-1 disables) |
max_tokens | 16 | Maximum number of tokens to generate |
stop | None | List of strings that trigger generation to stop |
presence_penalty | 0.0 | Penalizes tokens that have already appeared |
frequency_penalty | 0.0 | Penalizes tokens proportional to their frequency |
best_of | 1 | Generates N candidates and returns the one with highest log-probability |
SamplingParams class.2. The OpenAI-Compatible Server
vLLM ships with a built-in HTTP server that mirrors the OpenAI Chat Completions and Completions API.
This means you can point any application that uses the OpenAI Python SDK at your local vLLM instance by
changing the base_url. No other code changes are needed.
To start the server, use the vllm serve CLI command.
# Start the OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype auto
Once the server is running, you can send requests using curl or any HTTP client. The following example demonstrates using the OpenAI Python SDK to chat with the locally hosted model.
from openai import OpenAI
# Point the OpenAI client at the local vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM does not require an API key by default
)
# Use the standard Chat Completions API
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a binary search function in Python."},
],
temperature=0.3,
max_tokens=512,
)
print(response.choices[0].message.content)
Streaming is also supported. Replace client.chat.completions.create(...) with
client.chat.completions.create(..., stream=True) and iterate over the response chunks.
This gives users a token-by-token experience identical to the OpenAI API.
3. Model Loading and Configuration
vLLM can load models from HuggingFace Hub, local directories, or S3-compatible storage. The
LLM constructor and the vllm serve CLI accept several important
configuration flags that control memory usage and performance.
from vllm import LLM
# Load a quantized model with tensor parallelism across 2 GPUs
llm = LLM(
model="TheBloke/Llama-2-70B-Chat-GPTQ",
quantization="gptq",
tensor_parallel_size=2, # Shard across 2 GPUs
gpu_memory_utilization=0.85, # Reserve 15% for overhead
max_model_len=4096, # Maximum context length
dtype="float16", # Use FP16 for non-quantized layers
trust_remote_code=True, # Required for some custom models
)
| Flag | Description |
|---|---|
gpu_memory_utilization | Fraction of GPU memory to use for model weights and KV cache (0.0 to 1.0) |
tensor_parallel_size | Number of GPUs for tensor parallelism; the model is sharded across them |
max_model_len | Maximum sequence length; lower values free memory for larger batches |
quantization | Quantization method: "gptq", "awq", "squeezellm", or None |
enforce_eager | Disable CUDA graph capture; useful for debugging or variable-length workloads |
swap_space | CPU swap space in GB for offloading KV cache when GPU memory is exhausted |
Setting gpu_memory_utilization too high (above 0.95) can cause out-of-memory errors under bursty load, because the scheduler may attempt to admit more requests than the remaining KV cache space can accommodate. A value of 0.85 to 0.90 provides a good balance between throughput and stability.
4. Benchmarking vLLM Throughput
vLLM includes a built-in benchmarking script that measures tokens per second for both prefill (prompt processing) and decode (token generation) phases. The following command benchmarks the server with synthetic requests.
# Benchmark with 100 requests, input length 512, output length 128
python -m vllm.entrypoints.openai.api_server &
python -m vllm.benchmark_serving \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 100 \
--input-len 512 \
--output-len 128
Typical results on a single A100-80GB GPU for Llama 3.1 8B show throughput of approximately 2,000 to
3,500 output tokens per second with 32 concurrent requests, depending on sequence lengths and
quantization settings. This compares to roughly 150 to 300 tokens per second with naive
HuggingFace generate().
Summary
vLLM transforms LLM serving into a high-throughput system by combining PagedAttention and continuous batching (both covered in depth in Section 9.2) with an OpenAI-compatible API that makes it a drop-in replacement for cloud-hosted models, enabling local inference with zero code changes. In the next section, we examine HuggingFace's Text Generation Inference (TGI), which takes a different architectural approach to the same serving challenge.