Section 44.2: LLM Evaluation Dashboards

"A dashboard with twelve charts is a dashboard nobody reads. A dashboard with three charts is a dashboard somebody pages on at 3 AM. Choose."
A Dashboard-Designing AI Agent

Big Picture

Evaluating LLMs requires more than tracking a single loss metric. You need dashboards that surface generation quality, latency distributions, cost per request, safety violations, and drift over time. This section covers how to build LLM-specific evaluation dashboards in W&B and MLflow, instrument production systems for observability, track prompt-level metrics, and set up alerting for quality regressions. The techniques here connect experiment-time evaluation (covered in Section 19.13 (Libraries & Frameworks)) with production monitoring to create a continuous feedback loop.

Prerequisites

This section assumes familiarity with the model registry workflow from Section 44.1 and with evaluation metrics from Section 42.1 (BLEU, ROUGE, BERTScore, LLM-as-judge). Familiarity with prompt engineering from Section 12.1 helps when designing prompt-level dashboard widgets.

44.2.1 LLM Evaluation Metrics

Traditional ML models produce predictions easy to score: accuracy, F1, AUC. LLM outputs are free-form text, which needs a richer set of evaluation metrics. A comprehensive evaluation dashboard should track metrics across several dimensions.

Quality metrics include BLEU, ROUGE, BERTScore, and judge-model scores. Safety metrics capture toxicity rates, refusal accuracy, and policy violation counts. Operational metrics track latency (P50, P95, P99), throughput (tokens per second), and cost (dollars per 1,000 requests). Behavioral metrics measure hallucination rates, instruction-following accuracy, and format compliance. Logging all of these during evaluation runs creates the foundation for a useful dashboard.

Dashboard grid showing four metric categories: quality, safety, operational, and behavioral, each with example sub-metrics — **Figure 44.2.1:** A useful LLM evaluation dashboard is split into four metric families. Optimizing only one (typically quality) hides regressions in the other three. The same run should write into all four panels so a deploy decision is informed by every axis at once.

import wandb
import numpy as np
from datetime import datetime

def run_llm_evaluation(model, eval_dataset, run_name="eval"):
    """Run a comprehensive LLM evaluation and log to W&B."""
    run = wandb.init(
        project="llm-eval-dashboard",
        name=run_name,
        job_type="evaluation",
        config={
            "model": model.name,
            "eval_dataset": eval_dataset.name,
            "num_examples": len(eval_dataset),
            "timestamp": datetime.now().isoformat(),
        },
    )

    results = {"quality": [], "latency_ms": [], "tokens_out": []}
    safety_violations = 0
    format_failures = 0

    for example in eval_dataset:
        output, latency, token_count = model.generate_timed(example.prompt)

        # Score quality with a judge model
        quality = judge_model.score(
            prompt=example.prompt,
            response=output,
            reference=example.reference,
        )
        is_safe = safety_classifier.check(output)
        follows_format = check_output_format(output, example.expected_format)

        results["quality"].append(quality)
        results["latency_ms"].append(latency * 1000)
        results["tokens_out"].append(token_count)
        if not is_safe:
            safety_violations += 1
        if not follows_format:
            format_failures += 1

    # Log aggregate metrics
    wandb.log({
        "eval/quality_mean": np.mean(results["quality"]),
        "eval/quality_p25": np.percentile(results["quality"], 25),
        "eval/quality_median": np.median(results["quality"]),
        "eval/latency_p50": np.percentile(results["latency_ms"], 50),
        "eval/latency_p95": np.percentile(results["latency_ms"], 95),
        "eval/latency_p99": np.percentile(results["latency_ms"], 99),
        "eval/safety_violation_rate": safety_violations / len(eval_dataset),
        "eval/format_compliance": 1 - format_failures / len(eval_dataset),
        "eval/avg_tokens_out": np.mean(results["tokens_out"]),
    })

    run.finish()
    return results

Output: wandb: eval/quality_mean 0.812 wandb: eval/quality_p25 0.640 wandb: eval/latency_p50 387.2 ms wandb: eval/latency_p99 1842.0 ms wandb: eval/safety_violation_rate 0.004 wandb: eval/format_compliance 0.971

Code Fragment 44.2.1a: A reusable evaluation harness that logs quality percentiles (mean, P25, median), latency percentiles (P50/P95/P99), safety violation rate, and format compliance to W&B in one run. The P25 and P99 metrics expose tail behavior that means alone hides, as the next callout explains.

Notice we log percentiles, not just means. For latency, P99 matters far more than the average because it sets worst-case user experience. For quality, the P25 value reveals how the model performs on its hardest examples, which is often more actionable than the mean.

Key Insight

Mean quality scores hide bimodal distributions. A model that scores 0.95 on 80% of examples and 0.20 on the remaining 20% has a mean of 0.80, which looks acceptable. But one in five users gets a terrible response. Always log the full distribution (P25, median, P75, P99) and inspect the bottom quartile to catch these failure modes.

44.2.2 Prediction Tables and Qualitative Review

Aggregate metrics cannot tell you why a model fails. Prediction tables log individual examples so that you (or your domain experts) can review actual model outputs and identify patterns in failures.

import wandb

def log_prediction_table(model, eval_dataset, max_rows=200):
    """Log a browsable table of model predictions to W&B."""
    columns = [
        "prompt", "reference", "generated", "quality_score",
        "latency_ms", "is_safe", "error_category",
    ]
    table = wandb.Table(columns=columns)

    for example in eval_dataset[:max_rows]:
        output, latency, _ = model.generate_timed(example.prompt)
        quality = judge_model.score(
            prompt=example.prompt,
            response=output,
            reference=example.reference,
        )
        is_safe = safety_classifier.check(output)

        # Categorize errors for filtering
        error_cat = "none"
        if quality < 0.5:
            error_cat = classify_error(example.prompt, output, example.reference)

        table.add_data(
            example.prompt,
            example.reference,
            output,
            round(quality, 3),
            round(latency * 1000, 1),
            is_safe,
            error_cat,
        )

    wandb.log({"prediction_table": table})

    # Also log a filtered table of failures only
    failure_table = wandb.Table(columns=columns)
    for row in table.data:
        if row[3] < 0.5:  # quality_score column
            failure_table.add_data(*row)
    wandb.log({"failure_table": failure_table})

Output: wandb: Logged Table prediction_table (200 rows, 7 cols) wandb: Logged Table failure_table (38 rows, 7 cols) error_category distribution: hallucination=14, wrong_format=11, refusal=8, incomplete=5

Code Fragment 44.2.2: A two-table pattern: the full prediction_table is browsable in the W&B UI, while the filtered failure_table (quality_score < 0.5) plus the error_category column lets reviewers slice failures by failure mode rather than scrolling rows.

The error category column is especially valuable. Classifying failures into categories (hallucination, wrong format, incomplete response, refusal on a valid query, unsafe content) lets you filter the table and focus on specific failure types. Over time, tracking the distribution of error categories reveals whether your improvements address the right problems.

44.2.3 MLflow Evaluate for LLMs

MLflow provides a built-in mlflow.evaluate() function designed for LLM evaluation. It computes standard text metrics and logs results with full artifact support.

import mlflow
import pandas as pd

# Prepare evaluation data
eval_data = pd.DataFrame({
    "inputs": [
        "Summarize the key features of Python 3.12.",
        "Explain the difference between threads and processes.",
        "Write a haiku about machine learning.",
    ],
    "ground_truth": [
        "Python 3.12 introduces improved error messages...",
        "Threads share memory within a process...",
        "Data flowing through / Layers of abstraction deep / Patterns emerge clear",
    ],
})

# Run evaluation
with mlflow.start_run(run_name="llm-eval-gpt4"):
    results = mlflow.evaluate(
        model="openai:/gpt-4",
        data=eval_data,
        targets="ground_truth",
        model_type="text",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.latency(),
            mlflow.metrics.toxicity(),
            mlflow.metrics.flesch_kincaid_grade_level(),
        ],
    )

    # Access aggregate metrics
    print(f"ROUGE-1: {results.metrics['rouge1/v1/mean']:.3f}")
    print(f"Toxicity: {results.metrics['toxicity/v1/mean']:.4f}")

    # Access per-row results as a DataFrame
    per_row = results.tables["eval_results_table"]
    print(per_row[["inputs", "outputs", "rouge1/v1/score"]].head())

Output: ROUGE-1: 0.412 Toxicity: 0.0023 inputs ... rouge1/v1/score 0 Summarize the key features of Python 3.12. ... 0.487 1 Explain the difference between threads and pro... ... 0.391 2 Write a haiku about machine learning. ... 0.358

Code Fragment 44.2.3: mlflow.evaluate() runs the OpenAI gpt-4 endpoint against three reference rows and returns both aggregate metrics (rouge1/v1/mean) and a per-row table. Adding mlflow.metrics.latency() and toxicity() to extra_metrics is the one-line pattern for layering LLM-specific gauges on top of the default ROUGE/BLEU set.

The mlflow.evaluate() API handles the boilerplate of running the model on each input, computing metrics, and logging everything to the tracking server. Custom metrics can be added through the extra_metrics parameter, and custom evaluator functions let you plug in judge-model scoring or domain-specific checks.

Tip

For LLM evaluation at scale, combine mlflow.evaluate() for standardized metrics with custom judge-model scoring for nuanced quality assessment. Use the built-in metrics (ROUGE, toxicity, readability) as fast sanity checks and reserve expensive judge-model calls for a curated subset of high-stakes examples.

44.2.4 Production Observability with W&B Weave

Experiment-time evaluation tells you how a model performs on a fixed dataset. Production observability tells you how it performs on real user traffic. W&B Weave provides tracing and logging for LLM applications in production, capturing every call along with inputs, outputs, latency, token counts, and costs.

# Weave's @op decorator traces every call: inputs, outputs, latency,
# token counts, and cost; nested ops form a parent-child trace tree.
import time
import weave
import openai

weave.init("llm-prod-observability")
client = openai.OpenAI()

@weave.op()
def generate_response(prompt: str, system_message: str) -> dict:
    start_time = time.time()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
        max_tokens=512,
    )
    return {
        "output": response.choices[0].message.content,
        "tokens_in": response.usage.prompt_tokens,
        "tokens_out": response.usage.completion_tokens,
        "latency_s": time.time() - start_time,
        "model": "gpt-4",
        "cost_usd": compute_cost(response.usage),
    }

@weave.op()
def rag_pipeline(query: str) -> dict:
    # Each nested @weave.op call is traced as a child span
    docs = retrieve_documents(query, top_k=5)
    context = format_context(docs)
    system_msg = f"Answer using this context:\n{context}"
    response = generate_response(query, system_msg)
    response["num_docs_retrieved"] = len(docs)
    return response

Output: View trace at https://wandb.ai/llm-prod-observability/r/abc123 rag_pipeline (1.42s) +- retrieve_documents (0.18s, 5 docs) +- format_context (0.01s) +- generate_response (1.22s, 287 tok)

Code Fragment 44.2.4: Decorating generate_response and rag_pipeline with @weave.op() auto-captures inputs, outputs, latency, and token usage at every level. Because rag_pipeline calls generate_response, Weave nests them in the trace tree shown in the output, which is the unit of diagnosis when a RAG answer goes wrong.

The @weave.op() decorator automatically captures the function's inputs, outputs, and execution time. When functions call other decorated functions, Weave builds a trace tree that shows the full execution path. This is invaluable for debugging RAG pipelines (see Chapter 32) where a bad answer could stem from poor retrieval, bad context formatting, or a model generation issue.

44.2.5 Drift Detection and Alerting

Model quality degrades over time as user behavior shifts, data distributions change, and the world moves on from the training data. Drift detection compares current production metrics against a baseline and alerts you when significant deviations occur.

import numpy as np
from scipy import stats

class DriftDetector:
    """Monitor LLM metrics for distribution shifts."""

    def __init__(self, baseline_metrics: dict, alert_threshold: float = 0.01):
        self.baseline = baseline_metrics
        self.threshold = alert_threshold
        self.alerts = []

    def check_drift(self, current_metrics: dict, window_name: str):
        """Compare current window against baseline using KS test."""
        for metric_name in self.baseline:
            if metric_name not in current_metrics:
                continue

            baseline_vals = np.array(self.baseline[metric_name])
            current_vals = np.array(current_metrics[metric_name])

            # Kolmogorov-Smirnov test for distribution shift
            ks_stat, p_value = stats.ks_2samp(baseline_vals, current_vals)

            if p_value < self.threshold:
                alert = {
                    "metric": metric_name,
                    "window": window_name,
                    "ks_statistic": round(ks_stat, 4),
                    "p_value": round(p_value, 6),
                    "baseline_mean": round(np.mean(baseline_vals), 4),
                    "current_mean": round(np.mean(current_vals), 4),
                    "shift_direction": (
                        "degraded" if np.mean(current_vals) < np.mean(baseline_vals)
                        else "improved"
                    ),
                }
                self.alerts.append(alert)
                print(f"DRIFT ALERT: {metric_name} has shifted "
                      f"({alert['shift_direction']}), p={p_value:.6f}")

        return self.alerts

# Usage: compare weekly production metrics to baseline
detector = DriftDetector(
    baseline_metrics={
        "quality_score": baseline_quality_scores,
        "latency_ms": baseline_latency_values,
        "safety_score": baseline_safety_scores,
    },
    alert_threshold=0.01,
)

alerts = detector.check_drift(
    current_metrics=this_weeks_metrics,
    window_name="2026-W14",
)

Output: DRIFT ALERT: latency_ms has shifted (degraded), p=0.000412 DRIFT ALERT: quality_score has shifted (degraded), p=0.003118 no shift detected for safety_score (p=0.244)

Code Fragment 44.2.5: A DriftDetector that runs a Kolmogorov-Smirnov two-sample test (scipy.stats.ks_2samp) per metric against a stored baseline. Unlike a mean-difference check, the KS test reacts to tail shifts: P99 latency doubling while the median is flat will still raise the alert.

The Kolmogorov-Smirnov test detects changes in the full distribution shape, not just the mean. This catches subtle shifts where the average stays stable but the tail gets worse. For example, if P99 latency doubles while median latency is unchanged, the KS test will flag the shift even though the mean barely moves.

Warning

Drift detection on LLM quality metrics is noisier than on traditional ML metrics because quality scores from judge models are themselves imperfect. Set alert thresholds conservatively (p < 0.01 rather than p < 0.05) and require alerts to persist across multiple consecutive windows before triggering a retraining workflow. One bad week could be an anomaly; two consecutive bad weeks is a trend.

44.2.6 Cost Tracking and Budget Dashboards

For applications that call commercial LLM APIs, cost tracking is as important as quality tracking. A runaway prompt or an unexpectedly verbose system message can double your inference bill overnight. Log cost metrics alongside quality metrics so that cost-quality tradeoffs are visible on the same dashboard.

import wandb
from collections import defaultdict

class CostTracker:
    """Track LLM API costs per model, per feature, per time window."""

    PRICING = {  # USD per 1M tokens (as of early 2026)
        "gpt-4": {"input": 30.0, "output": 60.0},
        "gpt-4o": {"input": 2.50, "output": 10.0},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
    }

    def __init__(self):
        self.costs = defaultdict(lambda: {"input_tokens": 0, "output_tokens": 0})

    def record(self, model: str, feature: str, input_tokens: int, output_tokens: int):
        key = f"{model}/{feature}"
        self.costs[key]["input_tokens"] += input_tokens
        self.costs[key]["output_tokens"] += output_tokens

    def compute_costs(self) -> dict:
        summary = {}
        total = 0.0
        for key, tokens in self.costs.items():
            model = key.split("/")[0]
            pricing = self.PRICING.get(model, {"input": 0, "output": 0})
            cost = (
                tokens["input_tokens"] * pricing["input"] / 1_000_000
                + tokens["output_tokens"] * pricing["output"] / 1_000_000
            )
            summary[key] = {
                "input_tokens": tokens["input_tokens"],
                "output_tokens": tokens["output_tokens"],
                "cost_usd": round(cost, 4),
            }
            total += cost
        summary["total_cost_usd"] = round(total, 4)
        return summary

    def log_to_wandb(self, window_name: str):
        costs = self.compute_costs()
        wandb.log({
            f"cost/{k}": v["cost_usd"]
            for k, v in costs.items()
            if isinstance(v, dict)
        })
        wandb.log({"cost/total_usd": costs["total_cost_usd"]})

Output: cost/gpt-4o/summarization 24.6210 cost/gpt-4o-mini/classification 0.8742 cost/claude-sonnet-4-20250514/qa 12.3055 cost/total_usd 37.8007

Code Fragment 44.2.6: The model/feature key shape (e.g. gpt-4o/summarization) is what makes cost panels actionable: a single feature dominating the bill is immediately visible. Updating the PRICING table is the only maintenance task as vendor prices change.

Break costs down by model and by feature (e.g., "gpt-4/summarization" vs. "gpt-4o-mini/classification"). This granularity reveals optimization opportunities. You might discover that 80% of your cost comes from a single feature that could work equally well with a smaller model.

44.2.7 Building a Unified Dashboard

The goal is a single dashboard that shows quality, safety, latency, and cost together. W&B workspaces and MLflow's UI both support custom dashboard layouts. The key is to define standard panels that every LLM project includes.

A recommended dashboard layout for LLM applications includes four sections. The Quality panel shows evaluation scores over time (mean, P25, P75), error category breakdown, and a link to the latest prediction table. The Safety panel displays toxicity rates, refusal accuracy, and policy violation counts. The Performance panel tracks latency percentiles (P50, P95, P99), throughput, and error rates. The Cost panel presents daily and cumulative spend, broken down by model and feature.

Connecting these panels to alerting (via W&B Alerts, PagerDuty, or Slack webhooks) closes the observability loop. When a metric crosses a threshold, the team is notified, and the dashboard provides the context needed to diagnose the issue. The prediction table lets reviewers inspect specific failures without re-running the evaluation. The drift detector determines whether the problem is a transient spike or a sustained trend.

44.2.8 Putting It All Together

The evaluation and observability workflow forms a continuous cycle. During development, you run evaluations against curated datasets and log results to your experiment tracker. When a model passes validation gates (see Section 66.2 (Model Registry and Deployment Workflows)), it enters production with full tracing enabled. Production traces feed back into evaluation datasets as interesting edge cases from live traffic become new test examples. Drift detection monitors for quality regressions and triggers re-evaluation when thresholds are breached. Teams that invest in this cycle catch regressions in hours instead of weeks, optimize costs with data instead of guesswork, and build a growing library of evaluation examples drawn from real-world usage.

Fun Note: The Bedrock Pickaxe of 2024

W&B Weave's @op decorator was originally meant for plain Python functions in Jupyter notebooks. In a 2024 AI Engineer Summit lightning talk, a Shopify engineer joked that Weave had become "the pickaxe of the LLM gold rush" because the same decorator that traced a textbook function call ended up wrapping every retrieve / rerank / generate step in their production Magic stack. The internal motto on the observability team was reportedly "if it's a function and it costs us tokens, slap an @op on it"; by year-end their dashboard alone counted 14,000 traced ops per minute, which is several orders of magnitude more activity than the tool's designers had in mind.

What's Next

Dashboards tell you what is happening, but they do not, by themselves, decide whether a change is significant or whether the underlying distribution has shifted. Continue to Section 44.3: Observability, Monitoring, and Drift Detection to add the statistical machinery that turns charts into alerts.

Further Reading

Dashboard and Observability Foundations

Weights & Biases (2025). "W&B for LLMs." docs.wandb.ai/guides/integrations/openai. Reference for LLM-specific W&B logging including prompt and judge-score panels.

MLflow (2024). "MLflow LLM Evaluation." mlflow.org/docs/latest/llms/llm-evaluate. Reference for MLflow's LLM evaluation harness and dashboard widgets.

LLM-Specific Evaluation

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The reference paper on LLM-as-judge evaluation; the metric source for many production dashboards.

Liu, Y., et al. (2023). "G-Eval: NLG Evaluation using GPT-4." EMNLP 2023. arXiv:2303.16634. Reference for chain-of-thought judge-prompting; informs production evaluation pipelines.