Section 62.2: LLMOps & Continuous Improvement

The best LLM system is the one that gets better every day without anyone noticing it changed.
Deploy, Silently Evolving AI Agent

Big Picture

LLMOps extends MLOps with practices specific to language model applications. Prompts are code that must be versioned. Model behavior must be tested in production through A/B experiments with statistical rigor. User feedback must flow back into evaluation datasets, fine-tuning data, and prompt improvements to create a continuously improving system. Building on the evaluation practices from Chapter 42 and the observability tools from Section 42.6, this section covers the operational practices that separate prototype LLM apps from production-grade systems that improve over time.

Prerequisites

Before starting, make sure you are familiar with the production fundamentals from Section 62.1. The evaluation metrics covered earlier in the book provide the measurement framework that LLMOps builds upon. Application-architecture and deployment patterns are revisited in detail later in the book.

62.2.1 Prompt Versioning

Production Pattern: One-Command Rollback

When: always. If you cannot roll back, you cannot ship safely. How: treat (model_id, model_version, prompt_id, prompt_version, retriever_index_version) as a single deployable unit, versioned in a manifest. A deploy publishes the manifest to a feature flag or config store; rollback flips the pointer to the previous manifest in under 30 seconds. Watch for: stateful side effects that survive rollback (a retriever index migrated forward cannot be rolled back without dual-write, see Pattern P2 elsewhere in this chapter). Also: bake "rollback rehearsal" into your runbook quarterly. The first time you try to roll back at 3 a.m. is the worst possible time to discover that the previous manifest is missing a now-required field.

Production Pattern P10: Shadow-Eval Canary Deploys

What it is: After each prompt or model change, route 5% of live traffic to BOTH old and new versions for 24h. Run an LLM-as-judge on the paired outputs. Alert if the new version loses on more than 15% of comparisons. Auto-promote if the new version wins by >10% with statistical significance.

When not to use it: Pre-launch or single-developer projects with no users yet. Canary infrastructure has setup cost; pay it once you have enough traffic that 5% provides statistical signal in a day.

Production Pattern P5: Soft-Failure Guard

What it is: Track outcomes that look like success at the protocol level (HTTP 200, parseable JSON, non-empty result) but are still business-failures (wrong answer, empty list, hallucinated content). Each path through the system has a "what counts as soft failure here" definition that gets monitored independently of the hard-failure rate.

When not to use it: Pure infrastructure services with no semantic notion of success/failure (a load balancer, a CDN). Soft-failure tracking requires a domain model.

Catalogue: See the full discussion + variants in this chapter's LLMOps coverage (Pattern P5).

62.2.2 A/B Testing Framework

Figure 62.2.1b illustrates the end-to-end A/B testing pipeline, from hash-based traffic splitting through to online metric collection and statistical evaluation.

Figure 62.2.2: A/B testing pipeline for LLM prompt variants with hash-based traffic splitting and online metric collection.

import hashlib, random
from dataclasses import dataclass
@dataclass
class ABExperiment:
    """Simple A/B test for prompt variants."""
    name: str
    variant_a: str
    variant_b: str
    traffic_split: float = 0.5 # fraction going to variant B
    def assign(self, user_id: str) -> str:
        """Deterministic assignment based on user ID hash."""
        h = hashlib.md5(f"{self.name}:{user_id}".encode()).hexdigest()
        bucket = int(h[:8], 16) / 0xFFFFFFFF
        if bucket < self.traffic_split:
            return "B"
        return "A"
    def get_prompt(self, user_id: str) -> str:
        variant = self.assign(user_id)
        return self.variant_a if variant == "A" else self.variant_b
exp = ABExperiment(
    name="summarizer_prompt",
    variant_a="Summarize the following text:\n{text}",
    variant_b="Write a 2-sentence summary:\n{text}",
    )
for uid in ["user_101", "user_202", "user_303"]:
    print(f"{uid} -> variant {exp.assign(uid)}")

Output: user_101 -> variant A user_202 -> variant B user_303 -> variant A

Code Fragment 62.2.2a: Defines assign and get_prompt

62.2.3 Online Evaluation and Feedback Loops

This snippet implements an online evaluation pipeline that collects user feedback and logs quality metrics in production.

from dataclasses import dataclass, field
from datetime import datetime
import statistics
@dataclass
class FeedbackCollector:
    """Collect and aggregate user feedback for LLM outputs."""
    records: list = field(default_factory=list)
    def log(self, request_id: str, variant: str, rating: int,
        feedback_text: str = "", latency_ms: float = 0):
        self.records.append({
            "request_id": request_id, "variant": variant,
            "rating": rating, "feedback": feedback_text,
            "latency_ms": latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
            })
    def summary(self):
        by_variant = {}
        for r in self.records:
            v = r["variant"]
            by_variant.setdefault(v, []).append(r["rating"])
            return {
                v: {"mean": statistics.mean(ratings), "n": len(ratings)}
                for v, ratings in by_variant.items()
                }

Code Fragment 62.2.3: This FeedbackCollector class captures structured user feedback (ratings, corrections, metadata) alongside the prompt version that produced each response. The summary method aggregates feedback by variant, enabling data-driven comparison of prompt versions. This feedback data also feeds the data flywheel for continuous improvement.

Tip

Set a minimum sample size before drawing conclusions from A/B test results. A common mistake is stopping a test after 50 queries because variant B "looks 8% better." With small samples, random fluctuation easily produces 8% swings. Use at least 200 observations per variant for rating-based metrics, and 500 or more per variant for binary metrics like thumbs-up rate, to achieve statistical power above 80%.

Over time, the feedback from A/B tests feeds into a continuous improvement loop. Figure 62.2.3a depicts this data flywheel, where user interactions generate feedback that becomes curated evaluation data for model improvement.

Data flywheel cycle: user interactions generate feedback and logs that become curated eval data, which trains an improved model, which feeds back into user interactions, creating a self-improving production loop.

Figure 62.2.3b: The data flywheel turns production usage into training data, creating a self-improving cycle.

62.2.4 Model Registry

Table 62.2.1c: Model Registry Intermediate Comparison (as of 2026).

Registry Feature	MLflow	W&B	Hugging Face Hub
Model versioning	Yes (stages)	Yes (aliases)	Yes (revisions)
Prompt versioning	Via artifacts	Via artifacts	Via model card
A/B experiment tracking	Native	Native	Limited
Deployment integration	SageMaker, Azure ML	Launch	Inference Endpoints
Self-hosted option	Yes (open source)	Enterprise	Yes (enterprise)

# Experiment tracking setup
import mlflow
# Log a prompt experiment to MLflow
with mlflow.start_run(run_name="prompt_v2.1_test"):
    mlflow.log_param("prompt_version", "v2.1")
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.7)
    # Log evaluation metrics
    mlflow.log_metric("mean_rating", 4.2)
    mlflow.log_metric("hallucination_rate", 0.03)
    mlflow.log_metric("p50_latency_ms", 820)
    mlflow.log_metric("cost_per_request", 0.0023)
    # Log the prompt template as an artifact
    mlflow.log_text(
        "Write a 2-sentence summary of:\n{text}",
        "prompt_template.txt"
        )

Code Fragment 62.2.4: Logging prompt experiments to MLflow with full parameter tracking, metric recording, and artifact storage. Notice how the prompt template, model name, temperature, and evaluation scores are all captured together in a single experiment run. This ensures that any result can be reproduced exactly and that comparisons between prompt versions use consistent methodology.

Note

Prompt versioning should capture not just the template text but also the model name, temperature, max tokens, system prompt, and any few-shot examples. A prompt that works well with GPT-4o may fail with Claude or Llama, so the model is part of the prompt's identity.

Warning

A/B tests on LLM outputs require larger sample sizes than traditional web experiments because LLM quality metrics (like human ratings or LLM-as-Judge scores) have high variance. Plan for at least 200 to 500 samples per variant before drawing conclusions, and always compute confidence intervals rather than relying on point estimates.

Key Insight

The data flywheel is the most powerful long-term advantage of a production LLM system. Every user interaction generates data that can improve evaluation sets, fine-tuning corpora, and retrieval indices. Teams that invest in feedback collection infrastructure early will compound improvements over time, while teams that skip it remain stuck with static prompts and models.

Real-World Scenario

Building a Data Flywheel for Customer Support Summarization

Who: An AI product team at a SaaS company with 2 million support tickets per year

Situation: The team deployed an LLM to auto-summarize support conversations for agent handoffs. Initial quality was acceptable but inconsistent.

Problem: Without structured feedback, the team had no way to identify which summaries were helpful and which were misleading. A/B testing prompt variants took weeks because they lacked infrastructure to track variant assignments.

Dilemma: Investing in feedback infrastructure would delay the next feature launch by a month. Skipping it meant continuing to iterate blindly on prompt quality.

Decision: They built a lightweight feedback loop: thumbs up/down on each summary, plus a "needs correction" option that captured the agent's edited version.

How: Each summary was tagged with the prompt version hash, model name, and temperature. Weekly reports aggregated ratings by prompt version. Edited summaries became evaluation gold data and eventually fine-tuning examples.

Result: After three months, the feedback dataset contained 12,000 rated summaries and 800 corrected versions. A fine-tuned model trained on corrections scored 23% higher on faithfulness than the original prompted approach.

Lesson: The data flywheel is not theoretical; it produces measurable quality gains within months, but only if feedback collection infrastructure is built before optimization begins.

Research Frontier

Open Questions:

What does a mature LLMOps practice look like, and how does it differ from traditional MLOps? LLMs challenge existing MLOps assumptions about model training, versioning, and A/B testing.
How should organizations manage prompt versioning and migration when prompts are a critical part of the application logic?

Recent Developments (2024-2025):

Prompt management platforms (2024-2025) like PromptLayer, Humanloop, and Langfuse added version control, A/B testing, and rollback capabilities for prompts, treating them as first-class deployment artifacts.

Explore Further: Set up a prompt versioning workflow for an LLM application using an open-source tool. Create three prompt versions, A/B test them against your evaluation set, and practice a rollback when one underperforms.

Key Takeaways

Version prompts with content-addressable hashing and store the complete configuration (model, temperature, system prompt, few-shot examples) alongside the template.
Use hash-based traffic splitting for deterministic A/B assignment that remains consistent across user sessions.
Collect structured feedback (thumbs up/down, ratings, corrections) on every production response to fuel the data flywheel.
Plan for large sample sizes (200 to 500 per variant) and compute confidence intervals for LLM A/B tests due to high output variance.
Track experiments in a model registry (MLflow, W&B) that captures prompts, metrics, and model configurations together.
The data flywheel is a production LLM system's most valuable long-term asset; invest in feedback infrastructure early.

Self-Check

1. Why should prompt versioning use content-addressable hashing rather than sequential version numbers?

Show Answer

Content-addressable hashing ensures that the version ID is derived from the prompt content itself, making it impossible to accidentally assign the same version number to different content or to have two different systems disagree on what "v3" means. It also makes deduplication trivial: identical prompts always produce the same hash.

2. Why is hash-based traffic splitting preferred over random assignment in A/B tests?

Show Answer

Hash-based splitting is deterministic: the same user always sees the same variant across sessions. Random assignment could show different variants to the same user on different requests, contaminating the experiment and making it impossible to measure the effect of a variant on user behavior over time.

3. What is a data flywheel and why is it important for LLM applications?

Show Answer

A data flywheel is a virtuous cycle where production usage generates feedback data, which is curated into evaluation and training sets, which improves the model, which generates better interactions, producing more valuable data. It is important because LLM applications that leverage this cycle compound their quality improvements over time, creating a durable competitive advantage.

4. What metadata should be stored alongside a prompt version for full reproducibility?

Show Answer

Full reproducibility requires storing the prompt template, model name and version, temperature, max tokens, top-p, system prompt, few-shot examples, stop sequences, and any post-processing logic. The model is part of the prompt's identity because the same template can produce very different results with different models.

5. Why do LLM A/B tests require larger sample sizes than traditional web experiments?

Show Answer

LLM quality metrics (human ratings, LLM-as-Judge scores, task success rates) have much higher variance than binary click/conversion metrics. The probabilistic nature of LLM outputs means even identical inputs produce different results across runs. This high variance requires more samples to achieve statistical significance and reliable effect size estimates.

Exercises

Exercise 28.4.1: Prompt Versioning Conceptual

Explain why prompts should be treated as versioned artifacts, similar to source code. Describe the minimum metadata that should be stored alongside each prompt version.

Answer Sketch

Prompts directly control model behavior, so untracked changes can cause regressions. Minimum metadata: content hash (for identity), author, timestamp, target model, evaluation scores on a standard test set, deployment status (draft/staging/production), and a description of the change. This enables rollback, A/B testing, and audit trails, just as Git provides for code.

Exercise 28.4.2: A/B Testing for LLMs Conceptual

Describe how to set up an A/B test comparing two system prompt variants for a customer support chatbot. Include: traffic splitting, metrics to track, statistical test to use, and minimum sample size calculation.

Answer Sketch

Traffic splitting: randomly assign users (not requests) to variant A or B using a consistent hash on user_id. Metrics: task completion rate (primary), user satisfaction score, average handle time, escalation rate. Statistical test: chi-squared test for completion rate, Mann-Whitney U for satisfaction scores. Sample size: for a 5% minimum detectable effect on a 60% baseline completion rate at 80% power and 5% significance, you need approximately 1,500 users per variant. Run for at least one full business cycle (7 days) to account for daily patterns.

Exercise 28.4.3: Feedback Loop Design Coding

Design a feedback collection system that captures thumbs-up/thumbs-down ratings, optional text feedback, and automatic quality signals (response length, latency, tool call success rate). Describe the data pipeline from collection to actionable improvements.

Answer Sketch

Collection: UI sends feedback events with {trace_id, rating, comment, timestamp}. Pipeline: (1) Store in an analytics database joined with trace data. (2) Aggregate daily quality metrics. (3) Flag low-rated responses for human review. (4) Export highly-rated (prompt, response) pairs to a fine-tuning dataset. (5) Export low-rated responses to a "hard examples" evaluation set. (6) Surface patterns in negative feedback (e.g., topic clustering) to guide prompt improvements. The key is closing the loop: feedback must flow back into evaluation data and system improvements.

Exercise 28.4.4: LLMOps Pipeline Analysis

Sketch the end-to-end LLMOps pipeline for a production chatbot, from prompt change to deployment. Include: prompt version control, automated evaluation, A/B testing, monitoring, and feedback-driven improvement. Identify the manual vs. automated steps.

Answer Sketch

Pipeline: (1) Engineer edits prompt in version control (manual). (2) CI runs automated evaluation suite (automated). (3) If scores pass threshold, deploy to staging (automated). (4) Run canary tests on staging (automated). (5) If canary passes, start A/B test with 10% traffic (automated). (6) After reaching sample size, analyze results (semi-automated). (7) If A/B test wins, promote to 100% (manual approval, automated execution). (8) Monitor quality metrics (automated). (9) Collect user feedback (automated). (10) Review feedback and plan next iteration (manual). Steps 2-5 and 8-9 should be fully automated; steps 1, 6-7, and 10 benefit from human judgment.

Exercise 28.4.5: Incident Response Discussion

Your LLM chatbot starts generating offensive responses after a provider model update. Describe the incident response process, from detection to resolution, including immediate containment, investigation, and post-mortem steps.

Answer Sketch

Detection: output guardrail alerts or user reports trigger the incident. Containment (minutes): roll back to the previous model version or activate a safe-mode prompt with stricter instructions. Investigation (hours): analyze flagged responses, identify the root cause (model update, prompt interaction, new attack vector). Resolution: implement additional guardrails, update the canary test suite to cover the failure mode, and coordinate with the provider. Post-mortem: document the timeline, root cause, impact, and preventive measures. Update the incident playbook and add regression tests.

What Comes Next

In the next chapter, Chapter 47: Safety, Ethics, and Regulation, we shift to security threats, hallucination detection in Section 49.5, regulation, and governance for LLM systems.

Further Reading

Core References

Zaharia, M. et al. (2024). The Shift from Models to Compound AI Systems. Berkeley AI Research. Influential blog post arguing that production AI has shifted from single models to compound systems combining retrieval, tools, and multiple models. Frames LLMOps as system-level optimization rather than model-level tuning. Essential context for understanding modern AI operations.

MLflow. (2024). MLflow: An Open Source Platform for the Machine Learning Lifecycle. Documentation for MLflow's experiment tracking, model registry, and deployment tools, now with LLM-specific features like prompt tracking. Covers the complete model lifecycle from experimentation to production. Recommended for teams standardizing their ML operations workflow.

Weights & Biases. (2024). W&B Prompts: Prompt Engineering and LLMOps. Guide to W&B's prompt management features including versioning, evaluation, and A/B testing integration. Demonstrates practical prompt lifecycle management with dashboards. Useful for teams that need collaborative prompt development with version control.

Shankar, S. et al. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Examines the reliability of using LLMs to evaluate other LLMs, revealing systematic biases in automated evaluation. Proposes calibration methods for LLM-as-judge approaches. Critical reading for teams relying on automated evaluation in production.

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive guide to running statistically rigorous A/B tests in production systems, written by practitioners from Microsoft and Google. Its methodology directly applies to testing prompt variants and model versions. Essential for data-driven LLMOps teams.

Hugging Face. (2024). Hugging Face Hub Documentation. Documentation for the Hugging Face Hub's model hosting, dataset sharing, and model versioning features. Covers model cards, Spaces deployment, and inference endpoints for production use. Essential reference for teams using Hugging Face as their model registry and deployment platform.