Section S.5: Scaling and Load Balancing for Production | Building Conversational AI with LLMs and Agents

Big Picture

Moving from a single-GPU inference setup to a production-grade serving cluster requires solving three problems: scaling (adding more GPU instances to handle growing traffic), load balancing (distributing requests intelligently across instances), and monitoring (knowing when to scale and diagnosing bottlenecks). This section covers horizontal and vertical scaling strategies, load balancer configurations, GPU utilization monitoring, tensor parallelism across multiple GPUs, throughput benchmarking, and cost optimization techniques.

1. Vertical vs. Horizontal Scaling

There are two fundamental approaches to handling more inference traffic. Vertical scaling means using a more powerful machine: more GPUs per node, faster GPUs, or more memory per GPU. This is often the right first step because it avoids the complexity of distributed systems. A single node with 4 or 8 GPUs running tensor parallelism can serve surprisingly high throughput.

Horizontal scaling means running multiple independent inference instances behind a load balancer. Each instance holds a full copy of the model (or a tensor-parallel shard) and processes requests independently. This approach scales linearly: doubling the number of instances roughly doubles your throughput capacity.

            ┌──────────────────────┐
            │    Load Balancer     │
            │  (NGINX / Envoy)    │
            └──────┬───┬───┬──────┘
                   │   │   │
          ┌────────┘   │   └────────┐
          │            │            │
   ┌──────▼──────┐ ┌──▼──────────┐ ┌▼────────────┐
   │  vLLM #1    │ │  vLLM #2    │ │  vLLM #3    │
   │  GPU 0,1    │ │  GPU 0,1    │ │  GPU 0,1    │
   │  (Node A)   │ │  (Node B)   │ │  (Node C)   │
   └─────────────┘ └─────────────┘ └─────────────┘

Figure S.5.1: Horizontal scaling architecture. A load balancer distributes incoming requests across three independent vLLM instances, each running on its own node with tensor parallelism across 2 GPUs.

Key Insight

Vertical scaling (tensor parallelism across GPUs within a node) reduces per-request latency because the model computation is split across GPUs. Horizontal scaling (multiple independent instances) increases total throughput without reducing per-request latency. In practice, you use both: tensor parallelism within each node for latency-sensitive models, and multiple nodes for capacity.

2. Multi-GPU Tensor Parallelism

For models that are too large for a single GPU, or when you need lower latency than a single GPU can provide, tensor parallelism splits the model across multiple GPUs within a single node. Each GPU holds a shard of the model's weight matrices and processes its portion of the computation in parallel. The GPUs synchronize via NVLink or PCIe after each attention and feed-forward layer.

All three serving frameworks support tensor parallelism through a simple configuration flag.

# vLLM: tensor parallelism across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90

# TGI: tensor parallelism across 4 GPUs
docker run --gpus all --shm-size 2g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 4

# SGLang: tensor parallelism across 4 GPUs
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --tp 4

Tip

Tensor parallelism works best when the number of attention heads is evenly divisible by the tensor parallel size. For example, Llama 3.1 70B has 64 attention heads, so it shards cleanly across 1, 2, 4, or 8 GPUs. Using 3 or 5 GPUs would require padding and waste memory.

3. Load Balancing Strategies

A load balancer sits in front of your inference instances and routes incoming requests. The choice of routing algorithm significantly affects tail latency and throughput. The following configurations show NGINX and Envoy setups for LLM serving.

3.1 NGINX Configuration

The following NGINX configuration uses least-connections routing, which sends each new request to the instance with the fewest active connections. This works well for LLM serving because requests have variable processing times.

# /etc/nginx/nginx.conf
upstream vllm_cluster {
    least_conn;  # Route to instance with fewest active connections

    server node-a:8000 max_fails=3 fail_timeout=30s;
    server node-b:8000 max_fails=3 fail_timeout=30s;
    server node-c:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;

    # Increase timeouts for long-running generation requests
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location /v1/ {
        proxy_pass http://vllm_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Enable streaming (SSE)
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }

    location /health {
        proxy_pass http://vllm_cluster;
    }
}

3.2 Routing Algorithm Comparison

Different routing algorithms are suited to different workload patterns. The table below compares the most common options for LLM serving.

Algorithm	Behavior	Best For
Round Robin	Cycles through instances sequentially	Uniform request sizes; simple deployments
Least Connections	Routes to the instance with fewest active requests	Variable output lengths; general-purpose LLM serving
Least Latency	Routes to the instance with the lowest recent response time	Heterogeneous GPU types in the same cluster
Session Affinity	Routes all requests from the same user to the same instance	Multi-turn chat with server-side KV cache persistence

Figure S.5.2: Load balancing algorithms for LLM serving. Least-connections is the recommended default because LLM request durations vary widely based on output length.

Warning

Avoid round-robin load balancing for LLM serving. Because requests have highly variable processing times (a 10-token response vs. a 2000-token response), round-robin frequently overloads one instance while others sit idle. Least-connections routing adapts to this variance automatically.

4. GPU Utilization Monitoring

Effective scaling requires visibility into GPU utilization, memory consumption, and queue depth. The following Python script collects GPU metrics using the pynvml library and exposes them in a format suitable for Prometheus or custom dashboards.

import pynvml
import time
import json

def collect_gpu_metrics():
    """Collect GPU metrics for all available devices."""
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    metrics = []

    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
        power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # milliwatts to watts

        metrics.append({
            "gpu_index": i,
            "gpu_utilization_pct": util.gpu,
            "memory_used_gb": round(mem_info.used / (1024 ** 3), 2),
            "memory_total_gb": round(mem_info.total / (1024 ** 3), 2),
            "memory_utilization_pct": round(mem_info.used / mem_info.total * 100, 1),
            "temperature_c": temp,
            "power_watts": round(power, 1),
        })

    pynvml.nvmlShutdown()
    return metrics

# Collect and display metrics every 5 seconds
while True:
    for m in collect_gpu_metrics():
        print(json.dumps(m))
    time.sleep(5)

Load test results (100 concurrent users, 60s): Requests completed: 2,847 Avg latency: 1.2s p50: 0.9s, p95: 2.1s, p99: 3.4s Errors: 0 (0%) Throughput: 47.4 req/s

For production deployments, integrate these metrics with Prometheus and Grafana. vLLM and TGI both expose Prometheus-format metrics at their /metrics endpoints. The key metrics to track for scaling decisions are listed below.

Metric	Scaling Signal
Request queue depth	Scale up when queue consistently exceeds 2x your max batch size
GPU memory utilization	Above 95% indicates risk of OOM under burst load
Time to first token (TTFT)	Rising TTFT indicates the prefill phase is becoming a bottleneck
Tokens per second (throughput)	Flattening throughput despite increasing requests means capacity limit
GPU compute utilization	Below 60% with high queue depth suggests batching is suboptimal

Figure S.5.3: Key metrics for LLM serving autoscaling decisions.

5. Benchmarking Throughput

Before deploying to production, benchmark your serving setup to understand its capacity limits. The following script sends concurrent requests to a vLLM or TGI server and measures throughput, latency percentiles, and time to first token.

import asyncio
import aiohttp
import time
import statistics

async def benchmark_serving(
    url: str,
    num_requests: int = 100,
    concurrency: int = 16,
    max_tokens: int = 128,
):
    """Benchmark an OpenAI-compatible serving endpoint."""
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []
    first_token_times = []
    total_tokens = 0

    async def send_request(session, prompt):
        nonlocal total_tokens
        async with semaphore:
            start = time.perf_counter()
            first_token_time = None

            async with session.post(
                f"{url}/v1/completions",
                json={
                    "model": "default",
                    "prompt": prompt,
                    "max_tokens": max_tokens,
                    "temperature": 0.7,
                    "stream": True,
                },
            ) as resp:
                token_count = 0
                async for line in resp.content:
                    decoded = line.decode().strip()
                    if decoded.startswith("data:") and "[DONE]" not in decoded:
                        if first_token_time is None:
                            first_token_time = time.perf_counter() - start
                        token_count += 1

            elapsed = time.perf_counter() - start
            latencies.append(elapsed)
            if first_token_time:
                first_token_times.append(first_token_time)
            total_tokens += token_count

    prompts = [f"Write a short paragraph about topic number {i}." for i in range(num_requests)]

    start_time = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        tasks = [send_request(session, p) for p in prompts]
        await asyncio.gather(*tasks)
    wall_time = time.perf_counter() - start_time

    print(f"Total requests:       {num_requests}")
    print(f"Concurrency:          {concurrency}")
    print(f"Wall clock time:      {wall_time:.2f}s")
    print(f"Throughput:           {total_tokens / wall_time:.1f} tokens/sec")
    print(f"Median latency:       {statistics.median(latencies):.3f}s")
    print(f"P95 latency:          {sorted(latencies)[int(0.95 * len(latencies))]:.3f}s")
    print(f"P99 latency:          {sorted(latencies)[int(0.99 * len(latencies))]:.3f}s")
    if first_token_times:
        print(f"Median TTFT:          {statistics.median(first_token_times):.3f}s")

# Run the benchmark
asyncio.run(benchmark_serving("http://localhost:8000", num_requests=200, concurrency=32))

Auto-scaling metrics: Current replicas: 2 Queue depth: 45 Avg GPU utilization: 87% Scaling decision: scale up to 3 replicas New replicas: 3 (scaled in 45s)

6. Cost Optimization Strategies

GPU inference is expensive, and optimizing costs requires attention at multiple levels. The following strategies can reduce serving costs by 40% to 70% without sacrificing quality or availability.

6.1 Right-Sizing Quantization

As covered in Section S.4, quantization can cut GPU memory requirements by 4x. A 70B model quantized to 4 bits fits on a single A100-80GB instead of requiring two GPUs, halving your compute cost immediately. Benchmark your specific use case to verify that the quantized model meets your quality bar.

6.2 Autoscaling with Kubernetes

For workloads with variable traffic patterns, autoscaling inference pods based on GPU metrics avoids paying for idle capacity. The following Kubernetes configuration defines a Horizontal Pod Autoscaler that scales vLLM replicas based on GPU utilization.

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "75"    # Scale up when average GPU util exceeds 75%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

6.3 Spot and Preemptible Instances

Cloud providers offer GPU instances at 60% to 90% discounts through spot (AWS), preemptible (GCP), or low-priority (Azure) pricing. For inference workloads that can tolerate occasional interruptions, running a portion of your replicas on spot instances significantly reduces costs. Configure your load balancer to drain connections gracefully when a spot instance receives a termination notice.

Practical Example

A cost-effective production setup might use 2 on-demand vLLM replicas as a baseline (guaranteed availability) plus 4 spot instance replicas for burst capacity. The load balancer health checks automatically remove terminated spot instances, and the Kubernetes autoscaler replaces them when new spot capacity becomes available. This pattern reduces costs by roughly 50% compared to running all 6 replicas on-demand.

7. Production Checklist

Before deploying an LLM serving cluster to production, verify each item in the following checklist.

Production Readiness Checklist

Health checks: Load balancer probes the /health endpoint and removes unhealthy instances automatically
Graceful shutdown: Instances drain in-flight requests before terminating (set terminationGracePeriodSeconds to at least 120)
Request timeouts: Proxy timeout set to at least 300 seconds for long completions; client-side timeout set slightly higher
Rate limiting: Per-user or per-API-key rate limits prevent a single client from monopolizing the cluster
Monitoring: Prometheus scraping GPU metrics, request latencies, and queue depth with Grafana dashboards and alerts
Model versioning: Blue-green or canary deployment strategy for rolling out new model versions without downtime
Logging: Structured request/response logging (excluding sensitive content) for debugging and audit
Cost alerts: Budget alerts configured in your cloud provider to catch unexpected scaling events

Summary

Scaling LLM inference for production requires combining vertical scaling (tensor parallelism across GPUs) with horizontal scaling (multiple instances behind a load balancer). Least-connections routing handles the variable request durations inherent to autoregressive generation. Monitoring GPU utilization, queue depth, and time to first token provides the signals needed for autoscaling decisions. Cost optimization through quantization, spot instances, and right-sized autoscaling can reduce serving costs by 50% or more. With the techniques covered in this appendix, you have the tools to deploy, scale, and operate LLM inference in production across vLLM, TGI, and SGLang.