Building Conversational AI with LLMs and Agents
Appendix U: Docker and Containers for LLM Deployment

Writing Dockerfiles for ML and LLM Projects

Big Picture

A Dockerfile is the recipe that defines how to build a Docker image. For ML projects, writing an efficient Dockerfile requires careful attention to layer ordering, dependency caching, base image selection, and GPU support configuration. A well-structured Dockerfile can reduce build times from 30 minutes to under 2 minutes and cut image sizes by 60% or more.

1. Dockerfile Syntax and Structure

A Dockerfile is a plain text file containing a sequence of instructions. Each instruction creates a new layer in the image. The most important instructions for ML projects are FROM (base image), RUN (execute commands), COPY (add files), ENV (set environment variables), WORKDIR (set working directory), EXPOSE (declare ports), and CMD (default command).

The following Dockerfile builds an image for a simple ML inference service. Each line is annotated with its purpose.

# Start from the official Python 3.11 slim image (Debian-based, ~150 MB)
FROM python:3.11-slim

# Set environment variables to prevent Python from buffering output
# and to disable pip's cache for smaller image size
ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

# Set the working directory inside the container
WORKDIR /app

# Install system dependencies required by ML libraries
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (this layer is cached if requirements don't change)
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application code (this layer changes frequently)
COPY src/ ./src/
COPY config/ ./config/

# Expose the API port
EXPOSE 8000

# Default command to run the inference server
CMD ["python", "-m", "src.serve", "--host", "0.0.0.0", "--port", "8000"]
Key Insight

Layer ordering matters enormously for build speed. Instructions that change infrequently (system packages, Python dependencies) should appear before instructions that change often (application code). Docker caches layers from the top down and invalidates the cache from the first changed layer onward. By placing COPY requirements.txt before COPY src/, you avoid reinstalling all Python packages every time you edit a source file.

2. Choosing Base Images for ML Workloads

The base image determines the operating system, pre-installed libraries, and image size. For ML projects, three categories of base images are common.

Base Image Size Use Case
python:3.11-slim~150 MBCPU-only inference, lightweight services
nvidia/cuda:12.4.1-runtime-ubuntu22.04~3.5 GBGPU inference with custom Python setup
nvidia/cuda:12.4.1-devel-ubuntu22.04~5.5 GBGPU training (includes compiler toolchain)
nvcr.io/nvidia/pytorch:24.01-py3~15 GBFull PyTorch stack with NCCL, cuDNN, Apex
huggingface/transformers-pytorch-gpu~8 GBHuggingFace ecosystem, ready to use
Figure U.2.1: Common base images for ML Docker containers, ordered by size. Choose the smallest image that meets your requirements.

For production inference, prefer the runtime variant of CUDA images over the devel variant. The devel images include the CUDA compiler (nvcc) and header files needed for building custom CUDA kernels, but these add 2 GB or more to the image. If your application only runs pre-compiled models, the runtime libraries are sufficient.

3. GPU Passthrough with the NVIDIA Container Toolkit

Docker containers cannot access GPUs by default. The NVIDIA Container Toolkit (formerly nvidia-docker) provides a runtime hook that exposes host GPUs to containers. You must install it on the host machine before running GPU containers.

# Install the NVIDIA Container Toolkit on Ubuntu
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure
sudo systemctl restart docker

# Verify GPU access inside a container
docker run --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

The --gpus flag controls which GPUs are visible to the container. You can pass all for all GPUs, '"device=0"' for a specific GPU by index, or '"device=0,2"' for multiple specific GPUs.

Tip

The NVIDIA driver on the host must be compatible with the CUDA version in the container image. The container does not include the GPU driver itself; it uses the host driver. Check compatibility at NVIDIA's CUDA Compatibility page. As a rule of thumb, driver version 535+ supports CUDA 12.x containers.

4. Multi-Stage Builds for Smaller Images

ML images can easily exceed 10 GB because of compilation toolchains, development headers, and intermediate build artifacts. Multi-stage builds let you use a large build image to compile dependencies, then copy only the compiled artifacts into a smaller runtime image. This technique can reduce final image size by 50% or more.

# Stage 1: Build stage with full development tools
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder

RUN apt-get update && apt-get install -y python3 python3-pip python3-venv

# Create a virtual environment for clean dependency isolation
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies (some may compile C extensions)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Runtime stage with minimal footprint
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y --no-install-recommends \
        python3 libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy the virtual environment from the build stage
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy application code
WORKDIR /app
COPY src/ ./src/
COPY config/ ./config/

EXPOSE 8000
CMD ["python3", "-m", "src.serve"]
┌──────────────────────────────┐      ┌──────────────────────────────┐
│       Build Stage            │      │      Runtime Stage           │
│  nvidia/cuda:...-devel       │      │  nvidia/cuda:...-runtime     │
│  (5.5 GB base)               │      │  (3.5 GB base)              │
│                              │      │                              │
│  + python3, pip, gcc         │      │  + python3, libgomp          │
│  + compiled wheels           │ ───> │  + /opt/venv (from builder)  │
│  + header files              │ COPY │  + application code          │
│  + build artifacts           │      │                              │
│                              │      │  Final: ~4.5 GB              │
│  Total: ~8 GB (discarded)    │      │  (vs. ~8 GB single-stage)   │
└──────────────────────────────┘      └──────────────────────────────┘
        
Figure U.2.2: Multi-stage builds use a large build image to compile dependencies, then copy only the compiled results into a smaller runtime image. The build stage is discarded, saving several gigabytes.

5. The .dockerignore File

When you run docker build, Docker sends the entire build context (the directory containing the Dockerfile) to the Docker daemon. For ML projects, this directory may contain large datasets, model checkpoints, or virtual environments that should not be included in the image. A .dockerignore file specifies patterns to exclude from the build context, similar to .gitignore.

# .dockerignore for ML projects

# Python artifacts
__pycache__/
*.pyc
*.pyo
.venv/
venv/
*.egg-info/

# Data and models (mount these as volumes instead)
data/
datasets/
models/
checkpoints/
*.pt
*.pth
*.onnx
*.safetensors

# Development tools
.git/
.github/
.vscode/
.idea/
*.md
Makefile
docker-compose*.yml

# Environment files with secrets
.env
.env.*

# Jupyter artifacts
.ipynb_checkpoints/
*.ipynb
Warning

Forgetting a .dockerignore file is one of the most common mistakes in ML Docker projects. Without it, a COPY . . instruction will copy your entire 50 GB dataset into the image, inflating build times and image size. Always create .dockerignore before your first build.

6. Optimizing pip Install for Caching

Python dependency installation is often the slowest step in an ML Docker build. PyTorch alone can take several minutes to download and install. Two techniques dramatically speed up repeated builds.

First, copy requirements.txt separately from the rest of your code. This ensures that the pip install layer is cached as long as your dependencies do not change. Second, use Docker BuildKit's cache mount feature to persist the pip download cache across builds, even when requirements change.

# Enable BuildKit cache for pip downloads
# syntax=docker/dockerfile:1
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .

# Mount pip cache as a BuildKit cache volume
# Downloaded wheels persist across builds, speeding up installs
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

COPY src/ ./src/
CMD ["python", "-m", "src.serve"]

To use BuildKit cache mounts, enable BuildKit by setting the environment variable DOCKER_BUILDKIT=1 or by using docker buildx build instead of docker build.

7. Environment Variables and Configuration

ML containers often need configuration values such as model paths, API keys, batch sizes, and feature flags. Docker provides two mechanisms: ENV in the Dockerfile (baked into the image) and -e or --env-file at runtime (set per container).

# Baked-in defaults (can be overridden at runtime)
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
    MAX_MODEL_LEN=4096 \
    TENSOR_PARALLEL_SIZE=1 \
    LOG_LEVEL=info
# Override at runtime with -e flags
docker run --gpus all \
    -e MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.3" \
    -e TENSOR_PARALLEL_SIZE=2 \
    -e HF_TOKEN=hf_abc123 \
    mymodel:v1

# Or use an environment file
docker run --gpus all --env-file .env.production mymodel:v1
Practical Example

Never bake API keys or tokens into a Dockerfile with ENV. Anyone who pulls your image can read those values with docker inspect. Instead, pass secrets at runtime using -e, --env-file, or Docker secrets (for Swarm and Kubernetes). Store your .env file outside the build context and list it in .dockerignore.

8. Building and Tagging Images

The docker build command reads a Dockerfile and produces an image. Tagging your images with meaningful version identifiers is essential for tracking which model version, code commit, or configuration is deployed in each environment.

# Build and tag with a version number
docker build -t llm-server:1.0.0 .

# Tag with the git commit hash for traceability
docker build -t llm-server:$(git rev-parse --short HEAD) .

# Tag with multiple labels
docker build \
    -t myregistry.azurecr.io/llm-server:1.0.0 \
    -t myregistry.azurecr.io/llm-server:latest \
    .

# Push to a container registry
docker push myregistry.azurecr.io/llm-server:1.0.0

Summary

Writing effective Dockerfiles for ML workloads requires attention to layer ordering (stable dependencies first, volatile code last), base image selection (runtime vs. devel, slim vs. full), multi-stage builds for smaller images, and proper handling of GPU passthrough via the NVIDIA Container Toolkit. A well-crafted .dockerignore prevents accidental inclusion of large datasets, and BuildKit cache mounts speed up pip installs across builds. In the next section, we explore Docker Compose for orchestrating multi-container AI applications.