Building Conversational AI with LLMs and Agents
Appendix U: Docker and Containers for LLM Deployment

Docker Compose for Multi-Service AI Applications

Big Picture

Real-world AI applications rarely consist of a single container. A typical RAG system might include an LLM inference server, a vector database, a REST API gateway, a Redis cache, and a PostgreSQL database for user sessions. Docker Compose lets you define, configure, and launch all of these services with a single docker compose up command, using a declarative YAML file that describes the entire application stack.

1. Why Docker Compose?

In Section U.1, we launched individual containers with long docker run commands that included port mappings, volume mounts, network assignments, and environment variables. Managing five or six such commands manually is error-prone and tedious. Docker Compose replaces these ad-hoc commands with a single configuration file (docker-compose.yml or compose.yml) that defines all services, their relationships, and their configurations in one place.

Compose provides several capabilities beyond simple container launching. It creates an isolated network for the application stack automatically, manages service startup order with dependency declarations, supports health checks to ensure services are ready before dependents start, and enables scaling individual services with a single flag.

2. Compose File Structure

A Compose file uses YAML syntax and organizes configuration into top-level keys: services (the containers), volumes (persistent storage), and networks (communication channels). The following example shows the structure with a minimal two-service stack.

# compose.yml (or docker-compose.yml)
# Defines a simple API + database stack

services:
  api:
    build: ./api                    # Build from local Dockerfile
    ports:
      - "8000:8000"                 # Map host port to container port
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/appdb
    depends_on:
      db:
        condition: service_healthy  # Wait for DB health check
    volumes:
      - ./api/src:/app/src          # Mount source code for development

  db:
    image: postgres:16-alpine       # Use pre-built image
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: appdb
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user"]
      interval: 5s
      timeout: 3s
      retries: 5

volumes:
  pgdata:                           # Named volume for database persistence
Key Insight

Docker Compose automatically creates a bridge network for the stack. Services can reach each other using their service name as the hostname. In the example above, the API service connects to PostgreSQL at db:5432, where db is resolved by Docker's internal DNS. No manual network configuration is required.

3. A Complete RAG Application Stack

Let us build a realistic Compose file for a RAG (Retrieval-Augmented Generation) application. This stack includes an LLM inference server, a vector database for document embeddings, a REST API that orchestrates queries, a Redis cache for session management, and a PostgreSQL database for user data.

# compose.yml: RAG Application Stack
services:
  # LLM inference server (GPU-enabled)
  llm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --max-model-len 4096
      --gpu-memory-utilization 0.90
    ports:
      - "8001:8000"
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s           # LLM loading takes time

  # Vector database for document retrieval
  chromadb:
    image: chromadb/chroma:0.5.0
    ports:
      - "8002:8000"
    volumes:
      - chroma-data:/chroma/chroma
    environment:
      - ANONYMIZED_TELEMETRY=false

  # Application API server
  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    environment:
      - LLM_BASE_URL=http://llm:8000/v1
      - CHROMA_HOST=chromadb
      - CHROMA_PORT=8000
      - REDIS_URL=redis://redis:6379/0
      - DATABASE_URL=postgresql://raguser:ragpass@postgres:5432/ragdb
    depends_on:
      llm:
        condition: service_healthy
      chromadb:
        condition: service_started
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy

  # Session cache
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  # User and conversation database
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: raguser
      POSTGRES_PASSWORD: ragpass
      POSTGRES_DB: ragdb
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U raguser"]
      interval: 5s
      timeout: 3s
      retries: 5

volumes:
  hf-cache:
  chroma-data:
  redis-data:
  pgdata:
┌─────────────────────────────────────────────────────────────┐
│                    Docker Compose Network                    │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐│
│  │   API    │──>│   LLM    │   │  Redis   │   │Postgres ││
│  │ :8000    │   │  (vLLM)  │   │  :6379   │   │ :5432   ││
│  │          │──>│  :8000   │   │          │   │         ││
│  └────┬─────┘   └──────────┘   └──────────┘   └─────────┘│
│       │                                                     │
│       │         ┌──────────┐                                │
│       └────────>│ ChromaDB │                                │
│                 │  :8000   │                                │
│                 └──────────┘                                │
└─────────────────────────────────────────────────────────────┘
        
Figure U.3.1: Architecture of a RAG application stack in Docker Compose. The API service communicates with the LLM server, vector database, cache, and relational database over Docker's internal network. Each service is isolated in its own container.

4. Essential Compose Commands

Docker Compose provides a set of subcommands for managing the application lifecycle. The following commands cover the most common operations during development and deployment.

# Start all services in the background
docker compose up -d

# Start and rebuild images if Dockerfiles changed
docker compose up -d --build

# View logs from all services (follow mode)
docker compose logs -f

# View logs from a specific service
docker compose logs -f api

# Stop all services (preserves volumes)
docker compose down

# Stop and remove volumes (WARNING: deletes all data)
docker compose down -v

# List running services and their status
docker compose ps

# Execute a command inside a running service
docker compose exec api bash

# Scale a specific service to multiple instances
docker compose up -d --scale api=3
Tip

During development, use docker compose up without -d to see all service logs interleaved in your terminal. Press Ctrl+C to stop everything. For production, always use -d (detached mode) and monitor with docker compose logs -f in a separate terminal.

5. Health Checks and Dependency Management

The depends_on directive in Compose controls startup order, but by default it only waits for the container to start, not for the service inside it to be ready. This is a critical distinction for ML applications where an LLM server may take two minutes to load model weights. The condition: service_healthy option makes Compose wait for a service's health check to pass before starting its dependents.

Health checks are defined per service and specify a command that returns exit code 0 when the service is ready. The start_period parameter is especially important for LLM servers, as it tells Docker to ignore health check failures during the initial loading phase.

  llm:
    image: vllm/vllm-openai:latest
    healthcheck:
      # Probe the health endpoint
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s           # Check every 30 seconds
      timeout: 10s            # Fail if no response in 10 seconds
      retries: 3              # Mark unhealthy after 3 consecutive failures
      start_period: 180s      # Allow 3 minutes for model loading
Warning

Without condition: service_healthy, your API container may start and immediately crash because the LLM server is still loading model weights. Always pair GPU-intensive services with generous start_period values and use health check conditions on dependent services.

6. Environment Files and Configuration Management

Hardcoding configuration values in compose.yml makes the file difficult to reuse across environments (development, staging, production). Docker Compose supports .env files that supply variable values referenced in the Compose file with ${VARIABLE} syntax.

# .env file (not committed to version control)
LLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
LLM_MAX_LEN=4096
LLM_GPU_UTIL=0.90
POSTGRES_USER=raguser
POSTGRES_PASSWORD=supersecretpassword
HF_TOKEN=hf_abc123xyz
# compose.yml referencing environment variables
services:
  llm:
    image: vllm/vllm-openai:latest
    command: >
      --model ${LLM_MODEL}
      --max-model-len ${LLM_MAX_LEN}
      --gpu-memory-utilization ${LLM_GPU_UTIL}
    environment:
      - HF_TOKEN=${HF_TOKEN}

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
Practical Example

For multi-environment setups, maintain separate files like .env.dev, .env.staging, and .env.prod. Launch with a specific environment by passing the file explicitly: docker compose --env-file .env.prod up -d. Add all .env* files to your .gitignore to prevent accidental credential commits.

7. GPU Configuration in Compose

GPU access in Docker Compose requires the deploy.resources.reservations.devices configuration block. This is the Compose equivalent of the --gpus flag used with docker run.

services:
  llm:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1              # Number of GPUs to allocate
              capabilities: [gpu]   # Required capability

  training:
    build: ./training
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]  # Specific GPU IDs
              capabilities: [gpu]

You can allocate GPUs by count (let Docker choose which GPUs) or by specific device IDs (for deterministic allocation). When running multiple GPU services on the same machine, use device IDs to ensure each service gets its own GPU and they do not compete for memory.

8. Overrides and Profiles for Development

Docker Compose supports override files that layer additional configuration on top of the base compose.yml. This is useful for development-specific settings like source code mounts, debug ports, and relaxed security.

# compose.override.yml (automatically loaded in development)
services:
  api:
    volumes:
      - ./api/src:/app/src          # Hot-reload source code
    environment:
      - LOG_LEVEL=debug
      - RELOAD=true                 # Enable auto-reload (uvicorn)
    command: ["uvicorn", "src.main:app", "--reload", "--host", "0.0.0.0"]

  llm:
    # In development, use a smaller model for faster iteration
    command: >
      --model microsoft/Phi-3-mini-4k-instruct
      --max-model-len 2048

Compose automatically merges compose.yml with compose.override.yml if both exist. For production, use docker compose -f compose.yml -f compose.prod.yml up -d to load a production-specific override instead of the development defaults.

Summary

Docker Compose transforms multi-container management from a series of manual commands into a declarative configuration file. Health checks with dependency conditions ensure services start in the correct order, which is critical when LLM servers need minutes to load model weights. Environment files and override files enable clean separation between development and production configurations. In the next section, we focus specifically on containerizing LLM inference servers like vLLM, TGI, and Ollama.