Section 65.5a: Autoscaling, Networking & Storage for K8s LLMs

Autoscaling was easy. Networking was tedious. Storage was a debugging nightmare. Putting all three together in one Kubernetes cluster was the most expensive way I have ever rediscovered the value of monoliths.
Deploy, Cluster-Weary AI Agent

Big Picture

This section continues Section 65.5, which covered the scheduling and serving half of Kubernetes-native LLM operations: GPU scheduling for training, Kubeflow training operators, production LLM serving, and GPU sharing/isolation. Here we cover the elastic-capacity and substrate half: autoscaling LLM services with HPA and KEDA, plus the networking and storage considerations that decide whether a serving cluster scales gracefully or melts under load. For LLM serving in particular, Kubernetes is the substrate most teams reach for when a single GPU server cannot keep up: autoscaling, networking, and storage decisions made here directly determine the cost and latency of any agent or chatbot fleet.

Prerequisites

This section continues from Section 65.5, which introduced Kubernetes basics for LLM workloads: pods, deployments, services, and the basic GPU scheduling primitives. Familiarity with HPA (horizontal pod autoscaler) concepts, container networking, and persistent volumes from the earlier sections of Chapter 65 is assumed.

Fun Fact: The GPU That Forgot to Idle

One of the most common bills-of-shame in LLM ops is a Kubernetes deployment scaled to 'min 1, max 10' that quietly stays at 10 forever because the HPA's CPU-based metric never triggers a scale-down on GPU pods (where the bottleneck is VRAM, not CPU). Teams have paid five-figure monthly bills for ghost pods serving zero traffic. The fix is to scale on GPU utilization or queue depth, not CPU. The lesson: autoscaling defaults were written for CPU workloads in 2014, and LLMs are not CPU workloads.

65.5.5 Autoscaling LLM Services

Standard Kubernetes HPA (Horizontal Pod Autoscaler) scales based on CPU or memory utilization. For LLM serving, these metrics are poor proxies for load because GPU utilization and request queue depth are the actual bottlenecks. Effective LLM autoscaling requires custom metrics and strategies tailored to inference workload patterns.

Under the Hood: Horizontal Pod Autoscaler control loop

The HPA runs a periodic control loop. Each cycle it reads the chosen metric across current pods and computes a desired replica count with the ratio rule $\text{desired} = \lceil \text{current} \times (\text{currentMetric} / \text{targetMetric}) \rceil$. If queue depth averages 10 against a target of 5, it doubles replicas; if it falls to 2.5, it halves them. A tolerance band (default 10 percent) suppresses tiny adjustments, and per-direction stabilization windows damp oscillation by holding the highest recent recommendation before scaling down. For LLM serving the metric is swapped from CPU to a load-faithful signal like vLLM queue depth or TTFT, but the proportional formula driving the replica count is unchanged.

65.5.5.1 Custom Metrics for LLM Autoscaling

The most useful autoscaling signals for LLM inference workloads are:

**Table 65.5.2a**: Autoscaling metrics for LLM serving
Metric	Source	Target	When to Use
Queue depth	vLLM /metrics	< 10 pending requests	Bursty traffic patterns
TTFT P95	Prometheus	< 500ms	Latency-sensitive applications
GPU KV cache usage	vLLM /metrics	< 80% utilization	Long-context workloads
Requests per second	Istio/Envoy	Throughput target	Steady, predictable traffic
Active sequences	vLLM /metrics	Per-replica concurrency limit	Streaming responses

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama-70b-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama-70b-vllm
minReplicas: 2
maxReplicas: 12
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up
policies:
- type: Pods
value: 2 # Add at most 2 pods at a time
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1 # Remove at most 1 pod at a time
periodSeconds: 300 # Every 5 minutes
metrics:
# Primary metric: request queue depth from vLLM
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "5" # Scale up if >5 waiting requests
# Secondary metric: TTFT P95 latency
- type: Pods
pods:
metric:
name: vllm_time_to_first_token_p95
target:
type: AverageValue
averageValue: "500" # Scale up if TTFT P95 > 500ms

Code Fragment 65.5.6: An HPA that scales a Llama 70B deployment between 2 and 12 replicas using vLLM-specific metrics. The primary signal is vllm_num_requests_waiting (queue depth), with TTFT P95 as a secondary trigger. Asymmetric stabilization windows (60s up, 300s down) prevent thrashing during traffic fluctuations.

65.5.5.2 Scale-to-Zero with Warm Pools

For LLM services that serve intermittent traffic (internal tools, development environments, low-traffic endpoints), scaling to zero when idle saves significant GPU cost. However, LLM cold starts are slow: loading a 70B model from disk to GPU takes 2-5 minutes, which is unacceptable for user-facing services.

Warm pool strategies mitigate cold starts by maintaining a pool of pre-loaded model replicas that are suspended (GPU memory retained but not actively serving) rather than fully terminated. When traffic arrives, a warm replica can begin serving within seconds rather than minutes. KServe supports scale-to-zero with configurable grace periods:

Table 65.5.4a: KServe scale-to-zero configuration. These four keys live on the InferenceService.spec.predictor object; together they let GPU resources release after 10 minutes of inactivity and recover within seconds when traffic returns.

Key	Value	Behavior
`predictor.minReplicas`	`0`	Permit scale-to-zero; required for cost release on idle
`predictor.scaleTarget`	`1`	Each replica handles one concurrent inference at a time
`autoscaling.knative.dev/scale-to-zero-pod-retention-period`	`10m`	Grace period before releasing the last replica
`autoscaling.knative.dev/window`	`120s`	Sliding window used to evaluate traffic for scale decisions

65.5.5.3 Cold-Start Mitigation Strategies

For services that cannot tolerate cold-start latency, several strategies reduce or eliminate startup time:

Model preloading via init containers: An init container downloads the model weights to a local SSD (hostPath or emptyDir) before the serving container starts, parallelizing model download with pod scheduling.
Shared model cache (ReadWriteMany PVC): Store model weights on a shared file system (Lustre, GCS FUSE, EFS) so that new replicas read from cache rather than downloading from the model registry.
Predictive scaling: Use historical traffic patterns to pre-scale before expected demand spikes (e.g., scale up at 8 AM before the workday begins).
CUDA memory snapshots: Experimental approaches that snapshot the GPU state of a loaded model and restore it on a new GPU, reducing cold start to seconds.

Fun Fact

Ingress and Service in front of three GPU pods, all sharing a Persistent Volume Claim for model weights — A Kubernetes LLM serving stack: ingress -> service -> autoscaled GPU pods backed by a shared persistent volume for model weights.

Google's internal LLM serving infrastructure reportedly keeps "standby" TPU pods with models pre-loaded in memory, consuming significant resources even when idle. The cost of this approach is justified by the even larger cost of lost revenue when a user-facing LLM service takes 3 minutes to respond to the first request after a scale-up event. The engineering trade-off between idle resource cost and cold-start penalty is one of the defining tensions in production LLM serving economics.

65.5.6 Networking and Storage Considerations

Distributed LLM training on Kubernetes has unique networking requirements. NCCL (NVIDIA Collective Communications Library) needs high-bandwidth, low-latency communication between GPU pods, typically over InfiniBand or RoCE (RDMA over Converged Ethernet). Standard Kubernetes networking (overlay networks, kube-proxy) adds unacceptable overhead for collective operations.

65.5.6.1 Network Configuration for NCCL

Host networking: The simplest approach is to use hostNetwork: true on training pods, bypassing the Kubernetes overlay network entirely. This gives pods direct access to the InfiniBand or RDMA interfaces.
SR-IOV device plugin: For clusters that require network isolation, the SR-IOV device plugin exposes virtual functions (VFs) from the InfiniBand HCA as Kubernetes resources that pods can request.
RDMA shared device plugin: Allows multiple pods on the same node to share an RDMA interface without SR-IOV, useful for development and smaller-scale training.

65.5.6.2 Storage for Checkpoints and Model Artifacts

Distributed checkpointing (covered in Section 6.8) requires a parallel file system or object storage that all training pods can write to simultaneously. Common options on Kubernetes include:

Lustre or GPFS (via CSI driver): High-performance parallel file systems that provide the bandwidth needed for large checkpoint writes. Best for clusters with dedicated storage infrastructure.
GCS FUSE / S3 FUSE: Mount cloud object storage as a POSIX file system. Simpler to set up but lower write throughput; suitable for async checkpoint staging where the GPU-to-CPU copy is the blocking step.
Local NVMe (hostPath/emptyDir): Fastest for temporary storage (local checkpoint staging, model cache) but not shared across nodes and not persistent across pod restarts.

Warning: Common Misconception

Standard Kubernetes resource requests and limits are not sufficient for GPU workloads. Unlike CPU and memory, GPUs cannot be shared across pods at fractional granularity without hardware support (MIG or time-slicing). A pod requesting "1 GPU" gets an entire physical GPU, and if your scheduling does not account for GPU topology (NVLink, PCIe affinity), multi-GPU training jobs may experience dramatically degraded communication performance. Always use topology-aware scheduling and gang scheduling for distributed training.

Research Frontier

The Kubernetes ecosystem for LLM workloads is evolving rapidly. LeaderWorkerSet is a new Kubernetes API (alpha in 2024) designed specifically for multi-node ML training, providing native support for gang scheduling and topology-aware placement without external schedulers. Dynamic Resource Allocation (DRA) is a Kubernetes feature that will enable more sophisticated GPU scheduling, including MIG partition management, time-slicing policies, and GPU health monitoring at the scheduler level. Multi-cluster training research explores running a single training job across multiple Kubernetes clusters in different regions, using WAN-optimized collective operations to overcome cross-region bandwidth limitations.

Lab: Deploy and Load-Test an LLM

Duration: ~60 minutes Intermediate

Objective

Serve a language model using vLLM's OpenAI-compatible API server, then measure latency and throughput under load using Locust. You will first build a manual benchmarking harness (the "right tool" baseline), then use Locust for systematic load testing with configurable concurrency profiles.

What You'll Practice

Deploying an LLM with vLLM's high-performance serving engine
Measuring time-to-first-token (TTFT) and end-to-end latency
Writing Locust load test scripts for LLM inference endpoints
Analyzing throughput, p50/p95/p99 latencies, and error rates under concurrent load

Setup

Install vLLM and Locust. A GPU with at least 8 GB VRAM is required for serving. If no GPU is available, you can substitute any OpenAI-compatible endpoint.

Steps

Step 1: Start the vLLM server

Launch vLLM with a small model. The server exposes an OpenAI-compatible /v1/completions endpoint.

# Start vLLM in a separate terminal
python -m vllm.entrypoints.openai.api_server \
 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
 --port 8000 \
 --max-model-len 2048

Code Fragment 44.9.L2: Launch vLLM serving TinyLlama. For larger models, add --tensor-parallel-size N to shard across multiple GPUs.

Hint

Wait until you see "Uvicorn running on http://0.0.0.0:8000" before proceeding. If you hit out-of-memory errors, reduce --max-model-len or switch to a smaller model like facebook/opt-125m.

Step 2: Manual latency benchmark (from scratch)

Before using a load-testing framework, build a simple benchmark script to measure baseline latency. This reveals exactly what each timing metric captures.

import requests
import time
import statistics
def measure_request(prompt, max_tokens=50):
    """Send a single completion request and measure timing."""
    start = time.perf_counter()
    # Production code should wrap this in try/except for network errors.
    response = requests.post(
        "http://localhost:8000/v1/completions",
        json={
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": 0.0,
        },
    )
    end = time.perf_counter()
    data = response.json()
    latency_ms = (end - start) * 1000
    tokens_generated = data["usage"]["completion_tokens"]
    tokens_per_sec = tokens_generated / (latency_ms / 1000) if latency_ms > 0 else 0
    return {
        "latency_ms": latency_ms,
        "tokens": tokens_generated,
        "tokens_per_sec": tokens_per_sec,
    }
# Run 10 sequential requests for baseline measurement
prompt = "Explain the concept of attention in neural networks in three sentences."
results = [measure_request(prompt) for _ in range(10)]
latencies = [result["latency_ms"] for result in results]
throughputs = [result["tokens_per_sec"] for result in results]
print("Manual Benchmark (10 sequential requests):")
print(f" Median latency: {statistics.median(latencies):.1f} ms")
print(f" P95 latency: {sorted(latencies)[8]:.1f} ms")
print(f" Mean throughput: {statistics.mean(throughputs):.1f} tokens/sec")
print(f" Tokens per req: {results[0]['tokens']}")

Output: Manual Benchmark (10 sequential requests): Median latency: 412.6 ms P95 latency: 487.3 ms Mean throughput: 121.4 tokens/sec Tokens per req: 50

Code Fragment 65.5.8a: A measure_request() helper that fires one /v1/completions request, records wall-clock latency, and returns (time_to_first_token, total_latency, token_count). It is the lowest-level building block the rest of the load tests use to compute p50/p95/p99 distributions.

Step 3: Write a Locust load test

Create a Locust test file that simulates concurrent users hitting the LLM endpoint. Locust provides real-time statistics, percentile breakdowns, and failure tracking.

# Save this as locustfile.py
from locust import HttpUser, task, between
import json
class LLMUser(HttpUser):
    """Simulates a user sending completion requests to the LLM."""
    wait_time = between(0.5, 2.0)
    host = "http://localhost:8000"
    prompts = [
        "Summarize the key ideas of transformer architecture.",
        "What are the benefits of quantization for LLM deployment?",
        "Explain continuous batching in three sentences.",
        "Describe the difference between prefill and decode phases.",
        "What is KV cache and why does it matter for inference?",
        ]
    @task
    def generate_completion(self):
        import random
        prompt = random.choice(self.prompts)
        payload = {
            "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
            "prompt": prompt,
            "max_tokens": 100,
            "temperature": 0.7,
            }
        with self.client.post(
            "/v1/completions",
            json=payload,
            catch_response=True,
            ) as response:
            if response.status_code == 200:
                data = response.json()
                tokens = data["usage"]["completion_tokens"]
                response.success()
            else:
                response.failure(f"Status {response.status_code}")

Output: Name Request Count Median Response Time 95% Response Time 99% Response Time Average Response Time Requests/s POST /v1/completions 287 410.0 890.0 1200.0 462.3 4.8 Aggregated 287 410.0 890.0 1200.0 462.3 4.8 Manual baseline median: 234.7 ms Load test median: 410.0 ms Latency increase: 1.7x

Code Fragment 65.5.9a: A Locust locustfile.py defining LLMUser with one @task that submits varied prompts using wait_time=between(1, 3). Locust spawns N such virtual users and reports throughput, latency percentiles, and failure rate during the run.

Step 4: Run the load test and analyze results

Execute the Locust test with increasing concurrency levels to find the serving capacity and latency curve of your deployment.

# Run Locust in headless mode with 10 concurrent users
locust -f locustfile.py --headless \
 --users 10 --spawn-rate 2 --run-time 60s \
 --csv results/load_test

# For the web UI (interactive), omit --headless:
# locust -f locustfile.py

Code Fragment 65.5.10a: Invoking Locust headlessly with --users 10 --spawn-rate 2 --run-time 60s emits CSV stats files (load_test_stats.csv, _failures.csv) without the web UI. Headless mode is what you wire into CI: ramp, soak, and breakdown numbers without anyone clicking buttons.

import pandas as pd
# Analyze the Locust CSV output
stats = pd.read_csv("results/load_test_stats.csv")
print(stats[["Name", "Request Count", "Median Response Time",
    "95% Response Time", "99% Response Time",
    "Average Response Time", "Requests/s"]].to_string(index=False))
# Compare with the manual baseline
print(f"\nManual baseline median: {statistics.median(latencies):.1f} ms")
print(f"Load test median: {stats['Median Response Time'].iloc[-1]:.1f} ms")
print(f"Latency increase: "
    f"{stats['Median Response Time'].iloc[-1] / statistics.median(latencies):.1f}x")

Output: Name Request Count Median Response Time 95% Response Time 99% Response Time Average Response Time Requests/s POST /v1/completions 612 1840 3120 4250 1976.4 10.2 Manual baseline median: 412.6 ms Load test median: 1840.0 ms Latency increase: 4.5x

Code Fragment 65.5.11a: pandas reads the Locust _stats.csv and projects the four columns that matter operationally: median, p95, p99 latency, and request count. Inspecting p99 vs. p50 reveals whether the system has a long-tail problem that average-only graphs would hide.

Stretch Goals

Run the load test at 1, 5, 10, 25, and 50 concurrent users. Plot a latency-vs-concurrency curve to identify the saturation point.
Enable vLLM's streaming mode ("stream": true) and measure time-to-first-token separately from total generation time.
Add a second model replica behind a reverse proxy (nginx or HAProxy) and compare throughput with a single instance versus two load-balanced instances.

Key Takeaways

Standard Kubernetes scheduling is insufficient for distributed LLM training. Gang scheduling (Volcano) ensures all pods start simultaneously, and quota management (Kueue) provides fair resource sharing across teams.
The Kubeflow Training Operator's PyTorchJob CRD automates the coordination of multi-node distributed training, including elastic scaling and fault-tolerant restarts.
KServe provides production-grade LLM serving on Kubernetes with vLLM/TGI runtimes, canary rollouts, and custom health checks that account for model loading time.
MIG partitioning enables hardware-isolated GPU sharing for multi-model serving, while MPS provides softer time-slicing for less critical workloads.
LLM autoscaling requires custom metrics (queue depth, TTFT, KV cache utilization) rather than standard CPU/memory metrics. Scale-to-zero with warm pools balances cost savings against cold-start latency.
Distributed training on Kubernetes requires host networking or SR-IOV for NCCL performance, and parallel file systems or object storage for distributed checkpointing.

Exercises

Exercise 28.9.1: GPU scheduling design Conceptual

Your team shares a Kubernetes cluster with 128 H100 GPUs across 16 nodes. Three teams need: Team A needs 64 GPUs for a pretraining job, Team B needs 32 GPUs for fine-tuning, and Team C needs 16 GPUs for serving. Design a Kueue configuration with ClusterQueues and quotas that ensures fair sharing while allowing Team A to borrow unused GPUs from other teams.

Answer Sketch

Create three ClusterQueues: training-a (nominalQuota=64, borrowingLimit=32), training-b (nominalQuota=32, borrowingLimit=16), serving-c (nominalQuota=16, lendingLimit=16, borrowingLimit=0). Team A can borrow up to 32 GPUs from B and C when they are idle. Team C's serving queue has no borrowing (stable allocation for production). Preemption policy: serving-c has highest priority (never preempted), training-a and training-b have equal priority with LowerPriority preemption within their queues.

Exercise 28.9.2: KServe deployment Coding

Write a complete KServe InferenceService YAML for serving Mistral-7B-Instruct using vLLM with tensor parallelism across 2 GPUs. Include appropriate health checks, autoscaling based on queue depth, and a canary configuration for rolling out a new model version.

Answer Sketch

InferenceService with predictor container running vllm/vllm-openai, args: --model=mistralai/Mistral-7B-Instruct-v0.3, tensor-parallel-size=2, max-model-len=32768. Resources: 2 nvidia.com/gpu. Readiness probe on /health with initialDelaySeconds=90. HPA targeting vllm_num_requests_waiting with averageValue of 3. Canary annotation at 10%. MinReplicas=1, maxReplicas=6.

Exercise 28.9.3: MIG partitioning Analysis

You need to serve three models on a single H100: a 7B chat model (primary, latency-sensitive), a 1.5B embedding model, and a 0.5B classifier. Design a MIG partition scheme. Calculate the memory requirements for each model and verify they fit within the chosen MIG profiles. What happens if the chat model needs a longer context window?

Answer Sketch

7B FP16 needs ~14 GB for weights plus KV cache. Use 3g.40gb (40 GB, ample for 8K context). 1.5B FP16 needs ~3 GB. Use 2g.20gb. 0.5B needs ~1 GB. Use 1g.10gb. Remaining partition: 1g.10gb unused (spare). If the chat model needs 32K context, KV cache grows to ~8 GB additional, total ~22 GB: still fits in 3g.40gb. At 128K context, KV cache grows to ~32 GB: need to upgrade to 4g.40gb and reconfigure other partitions.

Exercise 28.9.4: Autoscaling strategy Discussion

Compare three autoscaling strategies for an LLM serving endpoint with bursty traffic: (a) HPA based on queue depth, (b) HPA based on TTFT P95, (c) Knative concurrency-based autoscaling. For each, describe the advantages, disadvantages, and failure modes. Which would you choose for a user-facing chatbot with a 2-second TTFT SLA?

Answer Sketch

(a) Queue depth: fast signal, direct indicator of overload, but does not capture latency directly. (b) TTFT P95: directly measures user experience, but lagging indicator (by the time P95 degrades, users are already affected). (c) Knative concurrency: proactive, limits requests per pod, but does not account for variable request complexity. For a 2-second TTFT SLA: use (a) as primary with aggressive thresholds (scale at queue > 3), supplemented by (b) as a safety net. Knative concurrency is a good default but harder to tune for LLM workloads where request duration varies 10x based on output length.

What Comes Next

In this section we covered gpu scheduling for llm training, kubeflow training operator, and related topics. This concludes the current chapter. Return to the chapter overview to review the material or explore related chapters.

Further Reading

Kubernetes GPU Scheduling

Kubernetes SIG Scheduling (2024). Kueue: Kubernetes-native Job Queueing. Official documentation for Kueue, covering ClusterQueues, ResourceFlavors, fair-sharing policies, and integration with various job types including PyTorchJobs and batch workloads.

Volcano Project (2024). Volcano: A Cloud Native Batch Scheduling System for Kubernetes. CNCF project providing gang scheduling, fair-share policies, and topology-aware placement for batch ML workloads on Kubernetes. Widely used for distributed training jobs.

Kubeflow and Training Operators

Kubeflow Project (2024). Kubeflow Training Operator Documentation. Official documentation for the Kubeflow Training Operator, covering PyTorchJob, elastic training, and integration with distributed training frameworks.

Model Serving

KServe Project (2024). KServe: Kubernetes Serverless Inference. Documentation for KServe, the Kubernetes-native model serving framework. Covers InferenceService CRDs, autoscaling, canary rollouts, and integration with vLLM and TGI runtimes.

GPU Management

NVIDIA (2024). NVIDIA GPU Operator Documentation. Official documentation for the GPU Operator, covering driver management, MIG configuration, MPS, DCGM integration, and the device plugin for Kubernetes GPU resource management.

MLCommons (2024). MLPerf Benchmarks. Industry-standard benchmarks for ML training and inference performance, including LLM-specific workloads. Useful for validating that Kubernetes-based deployments achieve expected hardware utilization.