Platforms

Section 45.1

Part VIII's platforms split into evaluation services (which run the eval suites) and serving / observability services (which run the production traffic).

45.1.1 Evaluation platforms

45.1.2 Serving platforms

45.1.3 Observability platforms

Inference Scaling and Load Balancing

Big Picture

Moving from a single-GPU inference setup to a production-grade serving cluster requires solving three problems: scaling (adding more GPU instances to handle growing traffic), load balancing (distributing requests intelligently across instances), and monitoring (knowing when to scale and diagnosing bottlenecks). This section covers horizontal and vertical scaling strategies, load balancer configurations, GPU utilization monitoring, tensor parallelism across multiple GPUs, throughput benchmarking, and cost optimization techniques.

1. Vertical vs. Horizontal Scaling

Two fundamental approaches handle more inference traffic. Vertical scaling means using a more powerful machine: more GPUs per node, faster GPUs, or more memory per GPU. This is often the right first step because it avoids distributed-systems complexity. A single node with 4 or 8 GPUs running tensor parallelism can serve surprisingly high throughput.

Horizontal scaling means running multiple independent inference instances behind a load balancer. Each instance holds a full copy of the model (or a tensor-parallel shard) and processes requests independently. This approach scales linearly: doubling the number of instances roughly doubles your throughput capacity.

Horizontal scaling: load balancer distributes requests to three vLLM instances
Figure 45.1.1: Horizontal scaling architecture. A load balancer distributes incoming requests across three independent vLLM instances, each running on its own node with tensor parallelism across 2 GPUs.
Key Insight

Vertical scaling (tensor parallelism across GPUs within a node) reduces per-request latency because the model computation is split across GPUs. Horizontal scaling (multiple independent instances) increases total throughput without reducing per-request latency. In practice, you use both: tensor parallelism within each node for latency-sensitive models, and multiple nodes for capacity.

2. Multi-GPU Tensor Parallelism

For models too large for a single GPU, or when you need lower latency than a single GPU can provide, tensor parallelism splits the model across multiple GPUs within a single node. Each GPU holds a shard of the model's weight matrices and processes its portion of the computation in parallel. The GPUs synchronize via NVLink or PCIe after each attention and feed-forward layer.

All three serving frameworks support tensor parallelism through a simple configuration flag.

# vLLM: tensor parallelism across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90

# TGI: tensor parallelism across 4 GPUs
docker run --gpus all --shm-size 2g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 4

# SGLang: tensor parallelism across 4 GPUs
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --tp 4
Code Fragment 45.1.1: vLLM: tensor parallelism across 4 GPUs
Tip

Tensor parallelism works best when the number of attention heads is evenly divisible by the tensor parallel size. For example, Llama-3.1 70B has 64 attention heads, so it shards cleanly across 1, 2, 4, or 8 GPUs. Using 3 or 5 GPUs would require padding and waste memory.

3. Load Balancing Strategies

A load balancer sits in front of your inference instances and routes incoming requests. The choice of routing algorithm significantly affects tail latency and throughput. The following configurations show NGINX and Envoy setups for LLM serving.

3.1 NGINX Configuration

The following NGINX configuration uses least-connections routing, which sends each new request to the instance with the fewest active connections. This works well for LLM serving because requests have variable processing times.

# /etc/nginx/nginx.conf
upstream vllm_cluster {
    least_conn;  # Route to instance with fewest active connections

    server node-a:8000 max_fails=3 fail_timeout=30s;
    server node-b:8000 max_fails=3 fail_timeout=30s;
    server node-c:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;

    # Increase timeouts for long-running generation requests
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location /v1/ {
        proxy_pass http://vllm_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Enable streaming (SSE)
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }

    location /health {
        proxy_pass http://vllm_cluster;
    }
}
Code Fragment 45.1.2: /etc/nginx/nginx.conf

3.2 Routing Algorithm Comparison

Different routing algorithms are suited to different workload patterns. The table below compares the most common options for LLM serving.

Algorithm Behavior Best For
Round RobinCycles through instances sequentiallyUniform request sizes; simple deployments
Least ConnectionsRoutes to the instance with fewest active requestsVariable output lengths; general-purpose LLM serving
Least LatencyRoutes to the instance with the lowest recent response timeHeterogeneous GPU types in the same cluster
Session AffinityRoutes all requests from the same user to the same instanceMulti-turn chat with server-side KV cache persistence
Figure 45.1.2: Load balancing algorithms for LLM serving. Least-connections is the recommended default because LLM request durations vary widely based on output length.
Warning

Avoid round-robin load balancing for LLM serving. Because requests have highly variable processing times (a 10-token response vs. a 2000-token response), round-robin frequently overloads one instance while others sit idle. Least-connections routing adapts to this variance automatically.

4. GPU Utilization Monitoring

Effective scaling requires visibility into GPU utilization, memory consumption, and queue depth. The following Python script collects GPU metrics using the pynvml library and exposes them in a format suitable for Prometheus or custom dashboards.

import pynvml
import time
import json

def collect_gpu_metrics():
    """Collect GPU metrics for all available devices."""
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    metrics = []

    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
        power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # milliwatts to watts

        metrics.append({
            "gpu_index": i,
            "gpu_utilization_pct": util.gpu,
            "memory_used_gb": round(mem_info.used / (1024 ** 3), 2),
            "memory_total_gb": round(mem_info.total / (1024 ** 3), 2),
            "memory_utilization_pct": round(mem_info.used / mem_info.total * 100, 1),
            "temperature_c": temp,
            "power_watts": round(power, 1),
        })

    pynvml.nvmlShutdown()
    return metrics

# Collect and display metrics every 5 seconds
while True:
    for m in collect_gpu_metrics():
        print(json.dumps(m))
    time.sleep(5)
Output: Load test results (100 concurrent users, 60s): Requests completed: 2,847 Avg latency: 1.2s p50: 0.9s, p95: 2.1s, p99: 3.4s Errors: 0 (0%) Throughput: 47.4 req/s
Code Fragment 45.1.3: Import pynvml

For production deployments, integrate these metrics with Prometheus and Grafana. vLLM and TGI both expose Prometheus-format metrics at their /metrics endpoints. The key metrics to track for scaling decisions are listed below.

Metric Scaling Signal
Request queue depthScale up when queue consistently exceeds 2x your max batch size
GPU memory utilizationAbove 95% indicates risk of OOM under burst load
Time to first token (TTFT)Rising TTFT indicates the prefill phase is becoming a bottleneck
Tokens per second (throughput)Flattening throughput despite increasing requests means capacity limit
GPU compute utilizationBelow 60% with high queue depth suggests batching is suboptimal
Figure 45.1.3: Key metrics for LLM serving autoscaling decisions.

5. Benchmarking Throughput

Before deploying to production, benchmark your serving setup to understand its capacity limits. The following script sends concurrent requests to a vLLM or TGI server and measures throughput, latency percentiles, and time to first token.

import asyncio
import aiohttp
import time
import statistics

async def benchmark_serving(
    url: str,
    num_requests: int = 100,
    concurrency: int = 16,
    max_tokens: int = 128,
):
    """Benchmark an OpenAI-compatible serving endpoint."""
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []
    first_token_times = []
    total_tokens = 0

    async def send_request(session, prompt):
        nonlocal total_tokens
        async with semaphore:
            start = time.perf_counter()
            first_token_time = None

            async with session.post(
                f"{url}/v1/completions",
                json={
                    "model": "default",
                    "prompt": prompt,
                    "max_tokens": max_tokens,
                    "temperature": 0.7,
                    "stream": True,
                },
            ) as resp:
                token_count = 0
                async for line in resp.content:
                    decoded = line.decode().strip()
                    if decoded.startswith("data:") and "[DONE]" not in decoded:
                        if first_token_time is None:
                            first_token_time = time.perf_counter() - start
                        token_count += 1

            elapsed = time.perf_counter() - start
            latencies.append(elapsed)
            if first_token_time:
                first_token_times.append(first_token_time)
            total_tokens += token_count

    prompts = [f"Write a short paragraph about topic number {i}." for i in range(num_requests)]

    start_time = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        tasks = [send_request(session, p) for p in prompts]
        await asyncio.gather(*tasks)
    wall_time = time.perf_counter() - start_time

    print(f"Total requests:       {num_requests}")
    print(f"Concurrency:          {concurrency}")
    print(f"Wall clock time:      {wall_time:.2f}s")
    print(f"Throughput:           {total_tokens / wall_time:.1f} tokens/sec")
    print(f"Median latency:       {statistics.median(latencies):.3f}s")
    print(f"P95 latency:          {sorted(latencies)[int(0.95 * len(latencies))]:.3f}s")
    print(f"P99 latency:          {sorted(latencies)[int(0.99 * len(latencies))]:.3f}s")
    if first_token_times:
        print(f"Median TTFT:          {statistics.median(first_token_times):.3f}s")

# Run the benchmark
asyncio.run(benchmark_serving("http://localhost:8000", num_requests=200, concurrency=32))
Output: Auto-scaling metrics: Current replicas: 2 Queue depth: 45 Avg GPU utilization: 87% Scaling decision: scale up to 3 replicas New replicas: 3 (scaled in 45s)
Code Fragment 45.1.4: Import asyncio

6. Cost Optimization Strategies

GPU inference is expensive, and optimizing costs requires attention at multiple levels. The following strategies can reduce serving costs by 40% to 70% without sacrificing quality or availability.

6.1 Right-Sizing Quantization

As covered in Section 9.2, quantization can cut GPU memory requirements by 4x. A 70B model quantized to 4 bits fits on a single A100-80GB instead of requiring two GPUs, halving your compute cost immediately. Benchmark your specific use case to verify that the quantized model meets your quality bar.

6.2 Autoscaling with Kubernetes

For workloads with variable traffic patterns, autoscaling inference pods based on GPU metrics avoids paying for idle capacity. The following Kubernetes configuration defines a Horizontal Pod Autoscaler that scales vLLM replicas based on GPU utilization.

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "75"    # Scale up when average GPU util exceeds 75%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
Code Fragment 45.1.5: vllm-hpa.yaml

6.3 Spot and Preemptible Instances

Cloud providers offer GPU instances at 60% to 90% discounts through spot (AWS), preemptible (GCP), or low-priority (Azure) pricing. For inference workloads that can tolerate occasional interruptions, running a portion of your replicas on spot instances significantly reduces costs. Configure your load balancer to drain connections gracefully when a spot instance receives a termination notice.

Real-World Scenario
Mixing On-Demand and Spot Replicas to Cut Cost

A cost-effective production setup might use 2 on-demand vLLM replicas as a baseline (guaranteed availability) plus 4 spot instance replicas for burst capacity. The load balancer health checks automatically remove terminated spot instances, and the Kubernetes autoscaler replaces them when new spot capacity becomes available. This pattern reduces costs by roughly 50% compared to running all 6 replicas on-demand.

7. Production Checklist

Before deploying an LLM serving cluster to production, verify each item in the following checklist.

Note: Production Readiness Checklist
  • Health checks: Load balancer probes the /health endpoint and removes unhealthy instances automatically
  • Graceful shutdown: Instances drain in-flight requests before terminating (set terminationGracePeriodSeconds to at least 120)
  • Request timeouts: Proxy timeout set to at least 300 seconds for long completions; client-side timeout set slightly higher
  • Rate limiting: Per-user or per-API-key rate limits prevent a single client from monopolizing the cluster
  • Monitoring: Prometheus scraping GPU metrics, request latencies, and queue depth with Grafana dashboards and alerts
  • Model versioning: Blue-green or canary deployment strategy for rolling out new model versions without downtime
  • Logging: Structured request/response logging (excluding sensitive content) for debugging and audit
  • Cost alerts: Budget alerts configured in your cloud provider to catch unexpected scaling events

Summary

Scaling LLM inference for production requires combining vertical scaling (tensor parallelism across GPUs) with horizontal scaling (multiple instances behind a load balancer). Least-connections routing handles the variable request durations inherent to autoregressive generation. Monitoring GPU utilization, queue depth, and time to first token provides the signals needed for autoscaling decisions. Cost optimization through quantization, spot instances, and right-sized autoscaling can reduce serving costs by 50% or more. With the techniques covered in this appendix, you have the tools to deploy, scale, and operate LLM inference in production across vLLM, TGI, and SGLang.

Production Data Pipelines and Serving at Scale

Big Picture

Production LLM systems require more than a trained model and a serving endpoint. They need data pipelines that continuously ingest, validate, and transform data; orchestration that coordinates training, evaluation, and deployment; monitoring that detects data drift and model degradation; and infrastructure that scales to handle unpredictable traffic. This section brings together the components from Sections O.1 through O.4 into end-to-end production architectures, covering pipeline orchestration with Airflow, CI/CD for ML, multi-model serving patterns, and observability for LLM systems.

End-to-End Pipeline Architecture

A production LLM pipeline typically consists of four stages: data ingestion (collecting raw data from APIs, databases, and event streams), data preparation (cleaning, filtering, and tokenizing into training-ready format), model training (fine-tuning or continued pretraining on prepared data), and deployment (packaging, validating, and serving the updated model). Each stage must be automated, idempotent, and observable.

End-to-end LLM pipeline: ingest, validate, prepare, train, deploy
Figure 45.1.4: End-to-end LLM pipeline. Each stage is an independently testable component, orchestrated by a workflow engine and monitored for data quality and model performance.

Pipeline Orchestration with Airflow

Apache Airflow is the most widely used orchestrator for ML pipelines. It represents workflows as directed acyclic graphs (DAGs) of tasks, with built-in support for scheduling, retries, alerting, and dependency management. For LLM pipelines, Airflow coordinates the steps that connect data preparation (using Spark on Databricks, as in Section 19.4 (Datasets & Benchmarks)) with distributed training (using Ray, as in Section 19.3 (Datasets & Benchmarks)) and deployment to an inference engine (as covered in Section 10.7 (vLLM Deep Dive)).

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.databricks.operators.databricks import (
    DatabricksRunNowOperator,
)
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "ml-engineering",
    "retries": 2,
    "retry_delay": timedelta(minutes=10),
    "email_on_failure": True,
    "email": ["ml-team@company.com"],
}

with DAG(
    dag_id="llm_training_pipeline",
    default_args=default_args,
    schedule_interval="@weekly",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    tags=["llm", "training"],
) as dag:

    # Step 1: Run data validation on the latest training data
    validate_data = DatabricksRunNowOperator(
        task_id="validate_training_data",
        databricks_conn_id="databricks_default",
        job_id=12345,  # Databricks job that runs Great Expectations
    )

    # Step 2: Prepare training data (tokenize, filter, split)
    prepare_data = DatabricksRunNowOperator(
        task_id="prepare_training_data",
        databricks_conn_id="databricks_default",
        job_id=12346,
    )

    # Step 3: Launch distributed training on Ray cluster
    train_model = KubernetesPodOperator(
        task_id="train_model",
        name="ray-training-job",
        image="company/ray-train:latest",
        cmds=["python", "train.py"],
        arguments=["--config", "configs/llama_sft.yaml"],
        namespace="ml-training",
        get_logs=True,
        startup_timeout_seconds=600,
    )

    # Step 4: Evaluate the trained model
    evaluate_model = KubernetesPodOperator(
        task_id="evaluate_model",
        name="model-evaluation",
        image="company/eval-harness:latest",
        cmds=["python", "evaluate.py"],
        namespace="ml-training",
    )

    # Define task dependencies
    validate_data >> prepare_data >> train_model >> evaluate_model
Code Fragment 45.1.6: Import from airflow
Tip

Use Airflow's ShortCircuitOperator to skip downstream tasks when a validation step fails. For example, if data validation detects that fewer than 1,000 new examples have been added since the last training run, short-circuit the DAG to avoid wasting GPU hours on a training run that will produce negligible improvement.

Data Validation with Great Expectations

Data quality is the most common root cause of model degradation in production. Great Expectations is an open-source framework that lets you define expectations (assertions about your data) and run them automatically before each training run. For LLM training data, useful expectations include checking that instruction and response columns are non-null, that text lengths fall within expected ranges, and that quality scores meet a minimum threshold.

import great_expectations as gx

context = gx.get_context()

# Connect to a Delta Lake data source
datasource = context.data_sources.add_spark(name="delta_lake")
data_asset = datasource.add_dataframe_asset(name="instruction_pairs")

# Define expectations for training data quality
batch = data_asset.get_batch()
validator = context.get_validator(batch=batch)

# No null values in critical columns
validator.expect_column_values_to_not_be_null("instruction")
validator.expect_column_values_to_not_be_null("response")

# Text length within expected range (catch truncation or corruption)
validator.expect_column_value_lengths_to_be_between(
    "instruction", min_value=10, max_value=4096
)
validator.expect_column_value_lengths_to_be_between(
    "response", min_value=5, max_value=8192
)

# Quality score meets minimum threshold
validator.expect_column_values_to_be_between(
    "quality_score", min_value=0.6, max_value=1.0
)

# Run validation and check results
results = validator.validate()
if not results.success:
    failing = [r for r in results.results if not r.success]
    raise ValueError(
        f"Data validation failed: {len(failing)} expectations not met"
    )
Code Fragment 45.1.7: Connect to a Delta Lake data source

CI/CD for ML: Automated Model Evaluation

Continuous integration for ML extends beyond code tests to include model evaluation. When a training pipeline produces a new model version, an automated evaluation step should run a standardized benchmark suite before the model is promoted to production. This prevents regressions where a newly trained model performs worse on specific tasks despite having a lower overall loss.

# .github/workflows/model-evaluation.yml
name: Model Evaluation Pipeline
on:
workflow_dispatch:
inputs:
model_version:
description: 'MLflow model version to evaluate'
required: true
model_name:
description: 'Registered model name'
default: 'ml_catalog.llm_models.llama_sft'
jobs:
evaluate:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install mlflow vllm lm-eval
- name: Download model from MLflow
run: |
python scripts/download_model.py \
--model-name ${{ inputs.model_name }} \
--version ${{ inputs.model_version }} \
--output-dir ./model
- name: Run evaluation benchmarks
run: |
python -m lm_eval \
--model vllm \
--model_args pretrained=./model \
--tasks mmlu,hellaswag,truthfulqa \
--batch_size auto \
--output_path ./eval_results
- name: Check quality gates
run: |
python scripts/check_quality_gates.py \
--results ./eval_results \
--min-mmlu 0.65 \
--min-hellaswag 0.75 \
--max-regression 0.02
- name: Promote model if gates pass
if: success()
run: |
python scripts/promote_model.py \
--model-name ${{ inputs.model_name }} \
--version ${{ inputs.model_version }} \
--stage production
Code Fragment 45.1.8: .github/workflows/model-evaluation.yml
Key Insight

The quality gate pattern is essential for production ML. Define explicit thresholds for each benchmark, and include a maximum regression check that compares the new model against the currently deployed version. A model can have excellent absolute scores but still be a regression on specific tasks that matter to your users. Always compare against the production baseline, not just a fixed threshold.

Multi-Model Serving Architecture

Production LLM applications rarely serve a single model. A typical deployment includes a primary generation model, an embedding model for retrieval, a reranker, and possibly a smaller classifier for intent routing. Managing these models requires an architecture that handles independent scaling, version management, and traffic routing for each model.

# Kubernetes deployment for multi-model serving with Ray Serve
# deploy_serving.py

from ray import serve
from ray.serve.config import AutoscalingConfig

@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        min_replicas=2,
        max_replicas=10,
        target_ongoing_requests=5,
    ),
    ray_actor_options={"num_gpus": 1},
)
class GenerationModel:
    def __init__(self):
        from vllm import LLM
        self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        min_replicas=1, max_replicas=4,
    ),
    ray_actor_options={"num_gpus": 0.5},
)
class EmbeddingModel:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("BAAI/bge-large-en-v1.5")

    async def embed(self, texts):
        return self.model.encode(texts).tolist()

@serve.deployment(num_replicas=1)
class Router:
    def __init__(self, generator, embedder):
        self.generator = generator
        self.embedder = embedder

    async def __call__(self, request):
        data = await request.json()
        endpoint = data.get("endpoint", "generate")

        if endpoint == "embed":
            return await self.embedder.embed.remote(data["texts"])
        elif endpoint == "generate":
            return await self.generator.generate.remote(data["prompt"])

# Compose the application
generator = GenerationModel.bind()
embedder = EmbeddingModel.bind()
app = Router.bind(generator=generator, embedder=embedder)
serve.run(app, host="0.0.0.0", port=8000)
Code Fragment 45.1.9: Kubernetes deployment for multi-model serving with Ray Serve

Observability and Monitoring

LLM systems require monitoring beyond traditional application metrics. In addition to latency, throughput, and error rates, you need to track token usage (for cost management), output quality (for detecting model degradation), and data drift (for knowing when to retrain). Prometheus and Grafana provide the infrastructure layer, while specialized LLM observability tools like LangSmith, Langfuse, or Phoenix handle trace-level analysis of prompt chains and retrieval quality.

# Prometheus metrics for LLM serving
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Request-level metrics
REQUEST_COUNT = Counter(
    "llm_requests_total", "Total LLM requests",
    ["model", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "llm_request_duration_seconds", "Request latency",
    ["model", "endpoint"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)
TOKENS_GENERATED = Counter(
    "llm_tokens_generated_total", "Total tokens generated",
    ["model"],
)
ACTIVE_REQUESTS = Gauge(
    "llm_active_requests", "Currently processing requests",
    ["model"],
)

# Start metrics endpoint
start_http_server(9090)

# Instrument your serving code
import time

async def serve_request(model_name, prompt):
    ACTIVE_REQUESTS.labels(model=model_name).inc()
    start = time.perf_counter()
    try:
        response = await generate(prompt)
        duration = time.perf_counter() - start

        REQUEST_COUNT.labels(
            model=model_name, endpoint="generate", status="success"
        ).inc()
        REQUEST_LATENCY.labels(
            model=model_name, endpoint="generate"
        ).observe(duration)
        TOKENS_GENERATED.labels(model=model_name).inc(
            response.usage.completion_tokens
        )
        return response
    except Exception as e:
        REQUEST_COUNT.labels(
            model=model_name, endpoint="generate", status="error"
        ).inc()
        raise
    finally:
        ACTIVE_REQUESTS.labels(model=model_name).dec()
Code Fragment 45.1.10: Prometheus metrics for LLM serving
Warning

Traditional ML monitoring tools are not sufficient for LLM systems. Input/output data drift detection requires embedding-based methods (not just statistical tests on numeric features), and output quality monitoring requires either automated evaluation with a judge model or structured human feedback loops. Budget for LLM-specific observability from the start; retrofitting it after launch is significantly harder.

Summary

Production LLM systems are distributed systems that require the same rigor as any critical infrastructure: automated pipelines, data validation, CI/CD with quality gates, multi-model serving, and comprehensive observability. Airflow or similar orchestrators coordinate the flow from data ingestion through training to deployment. Great Expectations catches data quality issues before they reach the model. CI/CD pipelines with benchmark evaluation prevent regressions. Multi-model serving architectures handle the reality that production LLM applications depend on several models working in concert. Finally, Prometheus metrics and LLM-specific observability tools provide the visibility needed to maintain reliability at scale. Together with the Databricks platform (Section 19.4 (Datasets & Benchmarks)), Delta Lake storage (Section 19.4 (Datasets & Benchmarks)), Ray compute (Section 19.3 (Datasets & Benchmarks)), and feature stores from the broader MLOps ecosystem, these components form a complete infrastructure stack for production LLM applications.

What's Next?

In the next section, Section 45.2: Libraries & Frameworks, we build on the material covered here.

Further Reading

Eval Platforms

Hugging Face (2024). "lm-evaluation-harness." github.com/EleutherAI/lm-evaluation-harness. The reference open-source LLM evaluation harness.
OpenAI (2024). "OpenAI Evals." github.com/openai/evals. Reference evaluation framework from OpenAI.
Langfuse (2024). "Langfuse Documentation." langfuse.com/docs. Reference open-source LLM observability platform.