Section U.4: Containerizing LLM Inference Servers | Building Conversational AI with LLMs and Agents

Big Picture

LLM inference servers like vLLM, Text Generation Inference (TGI), and Ollama are designed to run inside containers. Each provides official Docker images that handle GPU configuration, model loading, and API serving. This section covers practical patterns for deploying these servers in containers, including model weight management, GPU resource allocation, quantized model serving, and exposing OpenAI-compatible endpoints.

1. vLLM in Docker

vLLM publishes official Docker images on Docker Hub under the vllm/vllm-openai repository. These images include the vLLM engine, its OpenAI-compatible API server, and all necessary CUDA dependencies. Running a vLLM container requires GPU access and a model specification.

The following command launches vLLM with Llama 3.1 8B, exposes the API on port 8000, and mounts a persistent cache for downloaded model weights.

# Run vLLM with Llama 3.1 8B Instruct
docker run -d \
    --name vllm-server \
    --gpus '"device=0"' \
    -v hf-cache:/root/.cache/huggingface \
    -p 8000:8000 \
    -e HF_TOKEN=${HF_TOKEN} \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --dtype auto

# Verify the server is running (may take 1-2 minutes to load)
curl http://localhost:8000/v1/models

Once the server is ready, it accepts requests that follow the OpenAI API format. Any application built against the OpenAI SDK can point to this container by changing the base_url to http://localhost:8000/v1.

Key Insight

Model weight management is the biggest operational challenge for containerized LLMs. A 7B parameter model in float16 requires approximately 14 GB of storage. Without a persistent volume for the HuggingFace cache, Docker downloads the full model every time you recreate the container. Always mount a named volume at /root/.cache/huggingface.

2. Text Generation Inference (TGI) in Docker

HuggingFace's TGI provides its own Docker image optimized for transformer model serving. TGI uses Rust for the HTTP server and includes Flash Attention, continuous batching, and token streaming out of the box. The image is available on GitHub Container Registry.

# Run TGI with Mistral 7B Instruct
docker run -d \
    --name tgi-server \
    --gpus all \
    -v tgi-cache:/data \
    -p 8080:80 \
    -e HF_TOKEN=${HF_TOKEN} \
    ghcr.io/huggingface/text-generation-inference:2.4 \
    --model-id mistralai/Mistral-7B-Instruct-v0.3 \
    --max-input-tokens 2048 \
    --max-total-tokens 4096 \
    --max-batch-prefill-tokens 4096

# Test with a generation request
curl http://localhost:8080/generate \
    -H 'Content-Type: application/json' \
    -d '{"inputs": "What is Docker?", "parameters": {"max_new_tokens": 100}}'

TGI stores downloaded models in the /data directory inside the container. Mounting a persistent volume at this path prevents re-downloading. TGI also supports an OpenAI-compatible endpoint at /v1/chat/completions when launched with the --messages-api-enabled flag (enabled by default in recent versions).

3. Ollama in Docker

Ollama provides the simplest Docker experience for running LLMs locally. It manages model downloads, quantization, and serving through a single binary. Ollama is especially useful for development environments where you need a quick, self-contained LLM without manual model management.

# Run Ollama server
docker run -d \
    --name ollama \
    --gpus all \
    -v ollama-data:/root/.ollama \
    -p 11434:11434 \
    ollama/ollama:latest

# Pull a model (runs inside the container)
docker exec ollama ollama pull llama3.1:8b

# Test generation
curl http://localhost:11434/api/generate \
    -d '{"model": "llama3.1:8b", "prompt": "Explain containers in one sentence."}'

# Ollama also supports the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama3.1:8b",
        "messages": [{"role": "user", "content": "What is Docker?"}]
    }'

Tip

Ollama automatically serves quantized models (GGUF format) and selects the appropriate quantization level based on available GPU memory. For a 7B model, Ollama typically uses Q4_K_M quantization, which requires only about 4.5 GB of VRAM. This makes Ollama ideal for development on consumer GPUs with 8 GB of memory.

4. Comparing Inference Server Containers

Each inference server has different strengths. The following table compares key characteristics to help you choose the right one for your use case.

Containerized LLM Inference Servers

Feature	vLLM	TGI	Ollama
Primary use case	High-throughput production serving	Production serving with HF ecosystem	Local development and testing
OpenAI-compatible API	Yes (native)	Yes (built-in)	Yes (built-in)
Continuous batching	Yes (PagedAttention)	Yes (Flash Attention)	Limited
Tensor parallelism	Yes (multi-GPU)	Yes (multi-GPU)	No
Quantization support	GPTQ, AWQ, SqueezeLLM	GPTQ, AWQ, EETQ	GGUF (automatic)
Model management	Manual (HF Hub)	Manual (HF Hub)	Built-in (ollama pull)
Minimum GPU memory	~16 GB (7B FP16)	~16 GB (7B FP16)	~5 GB (7B Q4)
Image size	~8 GB	~10 GB	~2 GB

5. Model Weight Mounting Strategies

For production deployments, downloading model weights at container startup is unreliable (network failures, rate limits) and slow (minutes to hours for large models). Three strategies avoid this problem.

The first strategy is to pre-download weights to a host directory and bind-mount them. This works well for single-machine deployments.

# Pre-download model weights to the host
docker run --rm \
    -v /opt/models:/models \
    -e HF_TOKEN=${HF_TOKEN} \
    python:3.11-slim \
    pip install huggingface_hub && \
    python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    'meta-llama/Llama-3.1-8B-Instruct',
    local_dir='/models/llama-3.1-8b',
    local_dir_use_symlinks=False
)
"

# Mount the pre-downloaded weights
docker run -d --gpus all \
    -v /opt/models/llama-3.1-8b:/models/llama-3.1-8b:ro \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model /models/llama-3.1-8b

The second strategy bakes the model weights directly into the Docker image. This creates a large (20+ GB) but entirely self-contained image that can be deployed to any machine without network access to HuggingFace.

# Dockerfile that bakes model weights into the image
FROM vllm/vllm-openai:latest

# Download model weights during build
RUN pip install huggingface_hub
ARG HF_TOKEN
RUN python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    'meta-llama/Llama-3.1-8B-Instruct',
    local_dir='/models/llama-3.1-8b',
    local_dir_use_symlinks=False,
    token='${HF_TOKEN}'
)
"

CMD ["--model", "/models/llama-3.1-8b", "--max-model-len", "4096"]

Warning

When baking model weights into an image, pass the HuggingFace token as a build argument (--build-arg HF_TOKEN=...), not as an ENV instruction. Build arguments are not persisted in the final image layers. However, if you use multi-stage builds, ensure the token is only used in the build stage and not copied to the runtime stage.

6. Running Quantized Models in Containers

Quantization reduces model memory requirements by representing weights in lower precision (4-bit or 8-bit instead of 16-bit). This enables running larger models on smaller GPUs. Each inference server supports different quantization formats.

# vLLM with a GPTQ-quantized 70B model on a single GPU
docker run -d --gpus '"device=0"' \
    -v hf-cache:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantization gptq \
    --max-model-len 2048 \
    --gpu-memory-utilization 0.95

# TGI with an AWQ-quantized model
docker run -d --gpus all \
    -v tgi-cache:/data \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:2.4 \
    --model-id TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --quantize awq

# Ollama with a Q4 quantized model (automatic)
docker exec ollama ollama pull llama3.1:8b-instruct-q4_K_M

Practical Example

A 70B parameter model in float16 requires approximately 140 GB of VRAM, which means at least two A100-80GB GPUs. With GPTQ 4-bit quantization, the same model fits on a single A100 (approximately 35 GB). In a Docker container with vLLM, pass --quantization gptq and --tensor-parallel-size 1. The throughput penalty for 4-bit quantization is typically 10 to 20% compared to float16.

7. Multi-GPU Inference in Containers

For models that exceed single-GPU memory (even with quantization), both vLLM and TGI support tensor parallelism across multiple GPUs within a single container. The container needs access to all target GPUs.

# vLLM with tensor parallelism across 4 GPUs
docker run -d \
    --gpus '"device=0,1,2,3"' \
    --shm-size=16g \
    -v hf-cache:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

# TGI with tensor parallelism across 2 GPUs
docker run -d \
    --gpus '"device=0,1"' \
    --shm-size=8g \
    -v tgi-cache:/data \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:2.4 \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 2

The --shm-size flag is critical for multi-GPU containers. Tensor parallelism uses shared memory for inter-GPU communication via NCCL. Docker's default shared memory size (64 MB) is far too small. Set it to at least 1 GB per GPU, and 4 to 16 GB total for reliable operation.

8. Optimizing Container Image Size

LLM server images are inherently large due to CUDA libraries and PyTorch. However, several practices can reduce the overhead beyond the irreducible minimum.

# Optimized Dockerfile for a custom vLLM wrapper
FROM vllm/vllm-openai:latest

# Remove unnecessary packages from the base image
RUN pip uninstall -y \
    jupyterlab notebook scipy matplotlib \
    && rm -rf /root/.cache/pip

# Install only the additional packages you need
COPY requirements-extra.txt .
RUN pip install --no-cache-dir -r requirements-extra.txt

# Copy only the application code
COPY src/ /app/src/

# Set a non-root user for security
RUN useradd -m appuser
USER appuser

CMD ["python", "-m", "src.custom_server"]

Tip

Use docker image history <image> to see the size of each layer and identify what is consuming the most space. Often, development packages like Jupyter, matplotlib, and testing libraries are included in base images but not needed in production. Removing them can save 500 MB to 1 GB.

Summary

Containerizing LLM inference servers is straightforward thanks to official Docker images from vLLM, TGI, and Ollama. The primary operational challenge is managing multi-gigabyte model weights; persistent volumes, pre-downloaded bind mounts, and baked-in-image approaches each have their place depending on deployment context. Quantized models enable serving larger models on smaller GPUs, and tensor parallelism distributes models across multiple GPUs within a single container. In the final section, we move beyond single-machine Docker to explore container orchestration with Docker Swarm and Kubernetes for production ML deployments.