Containers, Kubernetes & Deployment

Chapter opener illustration: Containers.

"Kubernetes does not understand GPUs by default; you teach it."

DeployDeploy, Container-Disciplined AI Agent
Looking Back

Chapter 64 orchestrated workflows; this chapter packages the workers. Containers, Kubernetes, KServe, KubeAI, GPU operators, node autoscalers, and the production-grade serving stack that GPU-aware Kubernetes has become.

Big Picture

Production LLM serving runs on containers, and at scale, on Kubernetes. This chapter covers Docker fundamentals, writing Dockerfiles for ML workloads, Docker Compose for multi-service apps, containerizing inference servers (vLLM, TGI, Triton), and Kubernetes-native patterns for GPU scheduling and serving.

Chapter Overview

Containers and Kubernetes are the universal packaging and scheduling layer for production LLM stacks. This chapter walks the full progression: Docker fundamentals (images, containers, volumes), Dockerfile patterns for ML and LLM projects, Docker Compose for multi-service AI applications, containerizing LLM inference servers (vLLM, TGI, Ollama), and Kubernetes-native LLM operations (GPU scheduling, model serving, autoscaling, GPU partitioning).

Container and Kubernetes tooling has stabilized enough that even early-stage products should ship on it. This chapter is the practitioner's syllabus for getting there.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

What's Next?

This chapter begins with Section 65.1: Docker Fundamentals: Images, Containers, and Volumes. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

Cluster Orchestration for ML

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes." Communications of the ACM. ACM DL. The architectural lineage paper that explains why Kubernetes makes the design decisions it does, the prerequisite reading for any cluster-level LLM deployment.
Lim, S., Strauss, K., Hsieh, K., Cidon, A., Goyal, M., Liu, X., et al. (2024). "Llumnix: Dynamic Scheduling for Large Language Model Serving." OSDI 2024. arXiv:2406.03243. Cross-instance request migration and scheduling for LLM workloads, the cluster-scheduler counterpart of the per-pod kube concepts in this chapter.

GPU Scheduling & Inference Serving

Crankshaw, D., Sela, G.-E., Mo, X., Zumar, C., Stoica, I., Gonzalez, J., et al. (2020). "InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines." SoCC. ACM DL. Defines the autoscaling and latency-budget patterns that LLM-serving Kubernetes operators (KServe, Ray Serve) implement today.
NVIDIA (2024). "NVIDIA NIM: Inference Microservices for Generative AI." NVIDIA Developer Documentation. NVIDIA NIM Docs. The reference container packaging for production LLM serving on Kubernetes; pairs with TensorRT-LLM/Triton and defines the de-facto inference-microservice contract.