Section 42.9a: OTel Dashboards for LLM Operations

My dashboard had p50, p95, p99, error rates, token counts, and seven flavors of cost. It also had no alarms. I learned what a 4 a.m. page looks like the hard way.
Eval, Alarm-Wired AI Agent

Big Picture

This section continues Section 42.9, which covered why OpenTelemetry fits LLM systems and how to instrument the building blocks: API calls, trace propagation through agent chains, token tracking and cost attribution, and auto-instrumentation with OpenLLMetry. Here we put those traces to work in dashboards that surface latency, cost, error, and quality signals at a glance. These dashboards are what every LLM serving stack needs in production: the difference between a chatbot that quietly degrades and one whose latency drift you catch on the first request.

Prerequisites

This section continues from Section 42.9, which set up the OpenTelemetry foundations for LLM observability. Familiarity with OTel spans, attributes, and exporters from that section is required, along with general knowledge of evaluation metrics (Chapter 42 earlier sections) and prompt/response logging conventions.

Fun Fact: The Dashboard Nobody Watched

A widely circulated 2024 post-mortem from an LLM SaaS startup described how they shipped a 47-panel observability dashboard, then went six months without anyone opening it. The outage that finally got it noticed had been visible on panel 23 for three weeks. The lesson: a dashboard with too many panels is a dashboard with zero panels. The fix was three big graphs at the top, the rest collapsed by default, and a single Slack alert that linked directly to whichever graph was on fire.

Data flow from LLM app to OTel collector, fanning out to Grafana metrics, Tempo traces, and PagerDuty alerts — An OTel-based LLM observability stack: spans flow from the application to a collector, then fan out to dashboards (Grafana), traces (Tempo/Jaeger), and alerts.

42.9.6 Building OTel Dashboards for LLM Operations

Raw traces are useful for debugging individual requests, but operational excellence requires aggregated dashboards that show system health at a glance. The combination of OTel metrics and span-derived analytics enables dashboards that answer the questions every LLM operations team needs answered: What is the current P50/P95/P99 latency? How many tokens are we consuming per hour? What is our error rate by model and provider? Which features are the most expensive?

Grafana is the most common dashboard tool for OTel data. With Grafana Tempo for traces and Prometheus (or Mimir) for metrics, you can build unified dashboards that correlate latency spikes with token usage anomalies. The following example shows how to define custom OTel metrics that power these dashboards.

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# Set up metrics export
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4317"),
    export_interval_millis=10000, # Export every 10 seconds
    )
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

meter = metrics.get_meter("llm-app.operations")

# Define operational metrics
llm_latency = meter.create_histogram(
    "gen_ai.request.duration",
    unit="s",
    description="LLM request duration in seconds",
    )

llm_tokens = meter.create_counter(
    "gen_ai.tokens.consumed",
    description="Total tokens consumed by model and type",
    )

llm_errors = meter.create_counter(
    "gen_ai.errors",
    description="LLM API errors by type and model",
    )

active_requests = meter.create_up_down_counter(
    "gen_ai.requests.active",
    description="Currently in-flight LLM requests",
    )

# Usage in application code
import time

async def monitored_llm_call(model, messages, **kwargs):
    """LLM call with full metrics instrumentation."""
    labels = {"model": model, "provider": "openai"}
    active_requests.add(1, labels)
    start = time.monotonic()

    try:
        response = await client.chat.completions.create(
            model=model, messages=messages, **kwargs
            )
        duration = time.monotonic() - start
        llm_latency.record(duration, labels)
        llm_tokens.add(response.usage.prompt_tokens,
            {**labels, "token_type": "input"})
        llm_tokens.add(response.usage.completion_tokens,
            {**labels, "token_type": "output"})
        return response

    except Exception as e:
        llm_errors.add(1, {**labels, "error_type": type(e).__name__})
        raise

    finally:
        active_requests.add(-1, labels)

Code Fragment 42.9.6a: Custom OTel metrics for LLM operations dashboards. The histogram tracks latency distributions (enabling P50/P95/P99 calculations), the counter tracks token consumption by model and type, and the up-down counter tracks in-flight requests for concurrency monitoring. These metrics export to Prometheus or any OTel-compatible metrics backend.

A well-designed LLM operations dashboard typically includes four panels: (1) a latency panel showing P50, P95, and P99 latency by model, with alerting thresholds; (2) a token consumption panel showing input and output tokens per hour, broken down by model and feature; (3) an error rate panel showing errors by type (rate limit, timeout, server error) with trend lines; and (4) a cost panel showing estimated spend per hour and projected monthly cost. The drift monitoring from Section 44.6 adds a fifth dimension: quality metrics derived from automated evaluation scores.

Lab: End-to-End MLOps Pipeline with MLflow

Duration: ~60 minutes Intermediate

Objective

Set up a complete experiment tracking workflow using MLflow. You will first build a manual logging harness (the "right tool" baseline for understanding what gets tracked), then use MLflow's autologging and model registry features. By the end, you will have multiple tracked runs that you can compare in the MLflow UI.

What You'll Practice

Setting up MLflow tracking with a local backend
Logging parameters, metrics, and artifacts to experiment runs
Comparing runs across hyperparameter configurations
Registering and versioning model artifacts in the MLflow Model Registry

Setup

Install MLflow and a lightweight model library for the experiment.

Steps

Step 1: Manual experiment logging (from scratch)

Before relying on MLflow, build a simple JSON-based logging harness to understand what experiment tracking actually records. This makes the value of a dedicated tracking server concrete.

import json
import time
from datetime import datetime
from pathlib import Path

class ManualTracker:
    """A minimal experiment tracker using JSON files."""

    def __init__(self, experiment_dir="manual_experiments"):
        self.dir = Path(experiment_dir)
        self.dir.mkdir(exist_ok=True)

    def log_run(self, params, metrics, artifacts=None):
        run_id = f"run_{int(time.time())}"
        record = {
            "run_id": run_id,
            "timestamp": datetime.now().isoformat(),
            "params": params,
            "metrics": metrics,
            "artifacts": artifacts or [],
            }
        path = self.dir / f"{run_id}.json"
        path.write_text(json.dumps(record, indent=2))
        print(f"Logged run {run_id}: accuracy={metrics.get('accuracy', 'N/A')}")
        return run_id

    def compare_runs(self):
        runs = []
        for f in sorted(self.dir.glob("run_*.json")):
            runs.append(json.loads(f.read_text()))
            print(f"\n{'Run ID':<20} {'Accuracy':<10} {'Params'}")
            print("-" * 60)
            for r in runs:
                acc = r["metrics"].get("accuracy", "N/A")
                print(f"{r['run_id']:<20} {acc:<10.4f} {r['params']}")
                return runs

                # Quick test
                tracker = ManualTracker()
                tracker.log_run(
                    params={"model": "logistic_regression", "C": 1.0},
                    metrics={"accuracy": 0.85, "f1": 0.83},
                    )
                tracker.log_run(
                    params={"model": "logistic_regression", "C": 0.1},
                    metrics={"accuracy": 0.82, "f1": 0.80},
                    )
                tracker.compare_runs()

Code Fragment 42.9.7: Defining ManualTracker

Step 2: Set up MLflow and log experiments

Now switch to MLflow for proper experiment tracking with a UI, artifact storage, and run comparison.

import mlflow
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Configure MLflow (local file-based tracking)
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("iris-classification")

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
    )

# Run experiments with different hyperparameters
for C_value in [0.01, 0.1, 1.0, 10.0]:
    with mlflow.start_run(run_name=f"logreg_C={C_value}"):
        # Log parameters
        mlflow.log_param("model_type", "LogisticRegression")
        mlflow.log_param("C", C_value)
        mlflow.log_param("solver", "lbfgs")

        # Train
        model = LogisticRegression(C=C_value, solver="lbfgs", max_iter=200)
        model.fit(X_train, y_train)

        # Evaluate
        preds = model.predict(X_test)
        acc = accuracy_score(y_test, preds)
        f1 = f1_score(y_test, preds, average="weighted")

        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_weighted", f1)

        # Log the model as an artifact
        mlflow.sklearn.log_model(model, "model")

        print(f"C={C_value:>5}: accuracy={acc:.4f}, f1={f1:.4f}")

Output: C= 0.01: accuracy=0.8333, f1=0.8310 C= 0.1: accuracy=0.9667, f1=0.9666 C= 1.0: accuracy=1.0000, f1=1.0000 C= 10.0: accuracy=0.9667, f1=0.9666

Code Fragment 42.9.8: Using mlflow, sklearn, load_iris

Step 3: Compare runs and identify the best model

Use the MLflow search API to query runs programmatically and find the best-performing configuration.

import mlflow
import pandas as pd
# Query all runs from the experiment
experiment = mlflow.get_experiment_by_name("iris-classification")
runs_df = mlflow.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.accuracy DESC"],
)
# Display comparison table
cols = ["run_id", "params.C", "metrics.accuracy", "metrics.f1_weighted"]
available = [c for c in cols if c in runs_df.columns]
print(runs_df[available].to_string(index=False))
# Identify best run
best_run = runs_df.iloc[0]
print(f"\nBest run: {best_run['run_id']}")
print(f" C = {best_run['params.C']}")
print(f" Accuracy = {best_run['metrics.accuracy']:.4f}")
print(f" F1 = {best_run['metrics.f1_weighted']:.4f}")

Output: run_id params.C metrics.accuracy metrics.f1_weighted a3b8c2d1e4f5678901234567890abcde 1.0 1.0000 1.0000 b4c9d3e2f5a6789012345678901bcdef 10.0 0.9667 0.9666 c5d0e4f3a6b7890123456789012cdef0 0.1 0.9667 0.9666 d6e1f5a4b7c8901234567890123def01 0.01 0.8333 0.8310 Best run: a3b8c2d1e4f5678901234567890abcde C = 1.0 Accuracy = 1.0000 F1 = 1.0000

Code Fragment 42.9.9: Query all runs from the experiment

Step 4: Register the best model

Promote the best model to the MLflow Model Registry, assigning it a version and stage label for deployment tracking.

import mlflow
# Register the best model
best_run_id = best_run["run_id"]
model_uri = f"runs:/{best_run_id}/model"
result = mlflow.register_model(model_uri, "iris-classifier")
print(f"Registered model: {result.name}, version: {result.version}")
# Transition to staging
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="iris-classifier",
version=result.version,
stage="Staging",
)
print(f"Model version {result.version} transitioned to Staging")
# Load and verify the registered model
loaded_model = mlflow.sklearn.load_model(
f"models:/iris-classifier/Staging"
)
verify_preds = loaded_model.predict(X_test)
verify_acc = accuracy_score(y_test, verify_preds)
print(f"Verified accuracy from registry: {verify_acc:.4f}")
print("\nTo view the UI, run: mlflow ui --port 5000")

Output: Registered model: iris-classifier, version: 1 Model version 1 transitioned to Staging Verified accuracy from registry: 1.0000 To view the UI, run: mlflow ui, port 5000

Code Fragment 42.9.10: Register the best model

Stretch Goals

Add MLflow autologging (mlflow.sklearn.autolog()) and compare the captured metrics with your manual logging.
Log a confusion matrix plot as an artifact using mlflow.log_figure() and view it in the MLflow UI.
Set up a model promotion workflow: automatically transition a model from Staging to Production only if its accuracy exceeds a threshold.

Key Takeaways

OpenTelemetry provides the standardized observability backbone for LLM applications, with GenAI Semantic Conventions defining a common vocabulary for LLM telemetry.
Auto-instrumentation libraries (OpenLLMetry, Traceloop) add tracing to LLM calls with minimal code changes, capturing token counts, model IDs, and latency automatically.
Trace propagation through agent chains connects multi-step LLM workflows into a single trace, making it possible to debug complex agent behaviors end to end.
Token tracking and cost attribution use OTel span attributes to allocate LLM costs to specific features, teams, or customers.
Custom OTel dashboards for LLM operations should track p50/p95/p99 latency, tokens per second, cost per request, and error rates by model and endpoint.

Exercises

Exercise 30.6.1: Basic OTel Instrumentation Coding

Set up OpenTelemetry tracing for a simple LLM application that makes a single chat completion call. Export traces to the console using ConsoleSpanExporter. Verify that the span includes the model name, token counts, and latency.

Answer Sketch

Initialize a TracerProvider with a SimpleSpanProcessor(ConsoleSpanExporter()). Create a span with tracer.start_as_current_span("gen_ai.chat"), set attributes for model and tokens after the API call, and verify the output includes all expected fields. The console output will show the span as a JSON object with the attributes you set.

Exercise 30.6.2: RAG Pipeline Tracing Coding

Instrument a RAG pipeline with nested spans for embedding, vector search, and generation. Use a trace visualization tool (Jaeger or the console exporter) to verify that the spans form a proper parent-child hierarchy.

Answer Sketch

Create a parent span rag.pipeline, then use tracer.start_as_current_span() for each sub-step inside the parent's with block. OTel automatically links child spans to the active parent via context propagation. In Jaeger, you should see a waterfall view with the pipeline span at the top and sub-steps nested beneath it.

Exercise 30.6.3: Cost Attribution Dashboard Project

Build a multi-tenant cost attribution system using OTel metrics. Create counters for token usage and cost, labeled by tenant ID and feature name. Set up a Grafana dashboard (or equivalent) that shows per-tenant cost breakdown and alerts when any tenant exceeds their monthly budget.

Answer Sketch

Use the TokenTracker pattern from Code Fragment 42.9.4a. Add tenant and feature labels to all metric emissions. In Grafana, create a panel with PromQL: sum by (tenant_id)(rate(gen_ai_cost_usd_total[1h])) * 720 to project monthly costs. Set alert rules on the projected cost exceeding per-tenant thresholds.

Exercise 30.6.4: Privacy-Safe Tracing Conceptual

Design a content redaction strategy for OTel traces in a healthcare chatbot application subject to HIPAA. Specify which span attributes should be captured, which should be redacted, and how to handle trace storage and retention.

Answer Sketch

Capture: model name, token counts, latency, error types, feature name. Redact: all prompt content, completion content, and user identifiers. Use an OTel Collector processor to strip sensitive attributes before export. Store traces in a HIPAA-compliant backend with encryption at rest and access logging. Set retention to the minimum required for operational debugging (7 to 30 days). Never include PHI in span attributes or events.

What Comes Next

In the next chapter, Chapter 62: LLMOps & Deployment Engineering, we move from observability to the deployment, scaling, and operational patterns that bring LLM applications to production. The OTel instrumentation you learned here becomes the foundation for monitoring those production systems.