A Dockerfile is the recipe that defines how to build a Docker image. For ML projects, writing an efficient Dockerfile requires careful attention to layer ordering, dependency caching, base image selection, and GPU support configuration. A well-structured Dockerfile can reduce build times from 30 minutes to under 2 minutes and cut image sizes by 60% or more.
1. Dockerfile Syntax and Structure
A Dockerfile is a plain text file containing a sequence of instructions. Each instruction creates a new
layer in the image. The most important instructions for ML projects are FROM (base image),
RUN (execute commands), COPY (add files), ENV (set environment
variables), WORKDIR (set working directory), EXPOSE (declare ports), and
CMD (default command).
The following Dockerfile builds an image for a simple ML inference service. Each line is annotated with its purpose.
# Start from the official Python 3.11 slim image (Debian-based, ~150 MB)
FROM python:3.11-slim
# Set environment variables to prevent Python from buffering output
# and to disable pip's cache for smaller image size
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
# Set the working directory inside the container
WORKDIR /app
# Install system dependencies required by ML libraries
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first (this layer is cached if requirements don't change)
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application code (this layer changes frequently)
COPY src/ ./src/
COPY config/ ./config/
# Expose the API port
EXPOSE 8000
# Default command to run the inference server
CMD ["python", "-m", "src.serve", "--host", "0.0.0.0", "--port", "8000"]
Layer ordering matters enormously for build speed. Instructions that change infrequently (system packages, Python dependencies) should appear before instructions that change often (application code). Docker caches layers from the top down and invalidates the cache from the first changed layer onward. By placing COPY requirements.txt before COPY src/, you avoid reinstalling all Python packages every time you edit a source file.
2. Choosing Base Images for ML Workloads
The base image determines the operating system, pre-installed libraries, and image size. For ML projects, three categories of base images are common.
| Base Image | Size | Use Case |
|---|---|---|
python:3.11-slim | ~150 MB | CPU-only inference, lightweight services |
nvidia/cuda:12.4.1-runtime-ubuntu22.04 | ~3.5 GB | GPU inference with custom Python setup |
nvidia/cuda:12.4.1-devel-ubuntu22.04 | ~5.5 GB | GPU training (includes compiler toolchain) |
nvcr.io/nvidia/pytorch:24.01-py3 | ~15 GB | Full PyTorch stack with NCCL, cuDNN, Apex |
huggingface/transformers-pytorch-gpu | ~8 GB | HuggingFace ecosystem, ready to use |
For production inference, prefer the runtime variant of CUDA images over the
devel variant. The devel images include the CUDA compiler (nvcc)
and header files needed for building custom CUDA kernels, but these add 2 GB or more to the image. If
your application only runs pre-compiled models, the runtime libraries are sufficient.
3. GPU Passthrough with the NVIDIA Container Toolkit
Docker containers cannot access GPUs by default. The NVIDIA Container Toolkit (formerly nvidia-docker) provides a runtime hook that exposes host GPUs to containers. You must install it on the host machine before running GPU containers.
# Install the NVIDIA Container Toolkit on Ubuntu
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure
sudo systemctl restart docker
# Verify GPU access inside a container
docker run --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
The --gpus flag controls which GPUs are visible to the container. You can pass
all for all GPUs, '"device=0"' for a specific GPU by index, or
'"device=0,2"' for multiple specific GPUs.
The NVIDIA driver on the host must be compatible with the CUDA version in the container image. The container does not include the GPU driver itself; it uses the host driver. Check compatibility at NVIDIA's CUDA Compatibility page. As a rule of thumb, driver version 535+ supports CUDA 12.x containers.
4. Multi-Stage Builds for Smaller Images
ML images can easily exceed 10 GB because of compilation toolchains, development headers, and intermediate build artifacts. Multi-stage builds let you use a large build image to compile dependencies, then copy only the compiled artifacts into a smaller runtime image. This technique can reduce final image size by 50% or more.
# Stage 1: Build stage with full development tools
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y python3 python3-pip python3-venv
# Create a virtual environment for clean dependency isolation
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install Python dependencies (some may compile C extensions)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Stage 2: Runtime stage with minimal footprint
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# Copy the virtual environment from the build stage
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Copy application code
WORKDIR /app
COPY src/ ./src/
COPY config/ ./config/
EXPOSE 8000
CMD ["python3", "-m", "src.serve"]
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ Build Stage │ │ Runtime Stage │
│ nvidia/cuda:...-devel │ │ nvidia/cuda:...-runtime │
│ (5.5 GB base) │ │ (3.5 GB base) │
│ │ │ │
│ + python3, pip, gcc │ │ + python3, libgomp │
│ + compiled wheels │ ───> │ + /opt/venv (from builder) │
│ + header files │ COPY │ + application code │
│ + build artifacts │ │ │
│ │ │ Final: ~4.5 GB │
│ Total: ~8 GB (discarded) │ │ (vs. ~8 GB single-stage) │
└──────────────────────────────┘ └──────────────────────────────┘
5. The .dockerignore File
When you run docker build, Docker sends the entire build context (the directory containing
the Dockerfile) to the Docker daemon. For ML projects, this directory may contain large datasets, model
checkpoints, or virtual environments that should not be included in the image. A
.dockerignore file specifies patterns to exclude from the build context, similar to
.gitignore.
# .dockerignore for ML projects
# Python artifacts
__pycache__/
*.pyc
*.pyo
.venv/
venv/
*.egg-info/
# Data and models (mount these as volumes instead)
data/
datasets/
models/
checkpoints/
*.pt
*.pth
*.onnx
*.safetensors
# Development tools
.git/
.github/
.vscode/
.idea/
*.md
Makefile
docker-compose*.yml
# Environment files with secrets
.env
.env.*
# Jupyter artifacts
.ipynb_checkpoints/
*.ipynb
Forgetting a .dockerignore file is one of the most common mistakes in ML Docker projects. Without it, a COPY . . instruction will copy your entire 50 GB dataset into the image, inflating build times and image size. Always create .dockerignore before your first build.
6. Optimizing pip Install for Caching
Python dependency installation is often the slowest step in an ML Docker build. PyTorch alone can take several minutes to download and install. Two techniques dramatically speed up repeated builds.
First, copy requirements.txt separately from the rest of your code. This ensures that the
pip install layer is cached as long as your dependencies do not change. Second, use Docker
BuildKit's cache mount feature to persist the pip download cache across builds, even when requirements
change.
# Enable BuildKit cache for pip downloads
# syntax=docker/dockerfile:1
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
# Mount pip cache as a BuildKit cache volume
# Downloaded wheels persist across builds, speeding up installs
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
COPY src/ ./src/
CMD ["python", "-m", "src.serve"]
To use BuildKit cache mounts, enable BuildKit by setting the environment variable
DOCKER_BUILDKIT=1 or by using docker buildx build instead of
docker build.
7. Environment Variables and Configuration
ML containers often need configuration values such as model paths, API keys, batch sizes, and feature
flags. Docker provides two mechanisms: ENV in the Dockerfile (baked into the image) and
-e or --env-file at runtime (set per container).
# Baked-in defaults (can be overridden at runtime)
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
MAX_MODEL_LEN=4096 \
TENSOR_PARALLEL_SIZE=1 \
LOG_LEVEL=info
# Override at runtime with -e flags
docker run --gpus all \
-e MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.3" \
-e TENSOR_PARALLEL_SIZE=2 \
-e HF_TOKEN=hf_abc123 \
mymodel:v1
# Or use an environment file
docker run --gpus all --env-file .env.production mymodel:v1
Never bake API keys or tokens into a Dockerfile with ENV. Anyone who pulls your image can read those values with docker inspect. Instead, pass secrets at runtime using -e, --env-file, or Docker secrets (for Swarm and Kubernetes). Store your .env file outside the build context and list it in .dockerignore.
8. Building and Tagging Images
The docker build command reads a Dockerfile and produces an image. Tagging your images with
meaningful version identifiers is essential for tracking which model version, code commit, or configuration
is deployed in each environment.
# Build and tag with a version number
docker build -t llm-server:1.0.0 .
# Tag with the git commit hash for traceability
docker build -t llm-server:$(git rev-parse --short HEAD) .
# Tag with multiple labels
docker build \
-t myregistry.azurecr.io/llm-server:1.0.0 \
-t myregistry.azurecr.io/llm-server:latest \
.
# Push to a container registry
docker push myregistry.azurecr.io/llm-server:1.0.0
Summary
Writing effective Dockerfiles for ML workloads requires attention to layer ordering (stable dependencies
first, volatile code last), base image selection (runtime vs. devel, slim vs. full), multi-stage builds
for smaller images, and proper handling of GPU passthrough via the NVIDIA Container Toolkit. A well-crafted
.dockerignore prevents accidental inclusion of large datasets, and BuildKit cache mounts
speed up pip installs across builds. In the next section, we explore Docker Compose for orchestrating
multi-container AI applications.