Part VIII: Evaluation & Production
Chapter 31: Production Engineering & Operations

LLMOps & Continuous Improvement

The best LLM system is the one that gets better every day without anyone noticing it changed.

Deploy Deploy, Silently Evolving AI Agent
Big Picture

LLMOps extends MLOps with practices specific to language model applications. Prompts are code that must be versioned. Model behavior must be tested in production through A/B experiments with statistical rigor. User feedback must flow back into evaluation datasets, fine-tuning data, and prompt improvements to create a continuously improving system. Building on the evaluation practices from Chapter 29 and the observability tools from Section 29.5, this section covers the operational practices that separate prototype LLM apps from production-grade systems that improve over time.

Prerequisites

Before starting, make sure you are familiar with production fundamentals as covered in Section 31.1: Application Architecture and Deployment. The Section 29.1 provide the measurement framework that LLMOps builds upon.

1. Prompt Versioning

A teammate "just tweaks the system prompt real quick" in production. Response quality drops 15%. Nobody knows what changed because prompts live as string literals scattered across the codebase with no version history, no diff, and no rollback mechanism. In traditional software, this is the equivalent of editing production code without source control. It sounds absurd, yet most LLM teams operate exactly this way.

Prompt versioning solves this by treating every prompt as a versioned artifact with a content hash, metadata, and a deployment history.

Fun Fact

The unofficial motto of LLMOps is "git for prompts, but also for the model, the data, the config, and your sanity." Most teams discover they need prompt versioning the same way most people discover they need backups: right after losing something important.

By the end of this section, you will have a complete LLMOps toolkit: content-addressable prompt registries, A/B testing frameworks with statistical rigor, and feedback loops that continuously improve your system. We start by building a prompt registry from scratch.

Mental Model: The Recipe Binder

LLMOps is like running a professional kitchen where every recipe (prompt) lives in a version-controlled binder. When a chef modifies a recipe, the old version is preserved, the new one gets a date stamp, and you can always roll back to Tuesday's sauce if Wednesday's experiment flops. A/B testing is running two dishes as daily specials and tracking which one customers reorder. The feedback loop is reading the comment cards. Unlike a kitchen, though, your "recipes" are serving thousands of customers simultaneously, so a bad tweak scales instantly.

From prompt development through production monitoring to continuous improvement 1. Develop Write prompt template Version with hash Offline eval (test set) Choose model + params 2. A/B Test Hash-based splitting Run both variants Collect online metrics Statistical evaluation 3. Deploy Promote winner Model registry update Gradual rollout Rollback ready 4. Monitor Latency tracking Quality scores Cost per request Drift detection Feedback Loop (Data Flywheel) User corrections Thumbs up/down Failure patterns Model + Prompt Registry MLflow / W&B / HuggingFace Hub
Figure 31.4.1 The LLMOps lifecycle connects four phases (develop, test, deploy, monitor) through a continuous feedback loop, with a shared registry tracking all artifacts. The code below puts this into practice.


# Define PromptRegistry; implement __init__, register, get
# Key operations: results display, data protection, prompt construction
import json, hashlib
from datetime import datetime
from pathlib import Path

class PromptRegistry:
 """Version and manage prompts with content-addressable storage."""

 def __init__(self, store_path: str = "prompts/"):
 self.store = Path(store_path)
 self.store.mkdir(exist_ok=True)

 def register(self, name: str, template: str, metadata: dict = None):
 content_hash = hashlib.sha256(template.encode()).hexdigest()[:12]
 version = {
 "name": name,
 "hash": content_hash,
 "template": template,
 "metadata": metadata or {},
 "created_at": datetime.utcnow().isoformat(),
 }
 path = self.store / f"{name}_{content_hash}.json"
 path.write_text(json.dumps(version, indent=2))
 return content_hash

 def get(self, name: str, version_hash: str = None):
 if version_hash:
 path = self.store / f"{name}_{version_hash}.json"
 return json.loads(path.read_text())
 # Return latest version
 versions = sorted(self.store.glob(f"{name}_*.json"))
 return json.loads(versions[-1].read_text()) if versions else None

registry = PromptRegistry()
v1 = registry.register("summarizer", "Summarize: {text}")
v2 = registry.register("summarizer", "Provide a concise summary of: {text}")
print(f"v1={v1}, v2={v2}")
v1=a3f8b2c1d4e5, v2=7b9c1d3e5f8a
Code Fragment 31.4.1: Define PromptRegistry; implement __init__, register, get
Key Insight

Conduct a 30-minute red-teaming session on any chatbot you have access to. Try to make it: (1) reveal its system prompt, (2) generate content it is supposed to refuse, (3) role-play as a different AI, (4) produce contradictory safety responses. Document what works and what does not. This exercise builds practical intuition for LLM safety that no policy document can provide.

Fun Fact

Netflix once ran an A/B test on thumbnail images and discovered that showing a villain's face increased click-through rates more than showing the hero. LLMOps teams report a similar phenomenon: the prompt variant that "feels" best to engineers often loses to a shorter, blunter version when measured against real users.

The fundamental insight behind LLMOps is that prompts are code. They should be versioned, tested, reviewed, and deployed with the same discipline as application code. The difference is that prompts interact with a non-deterministic system, so testing must be statistical rather than deterministic (connecting back to the evaluation rigor from Section 29.2). A prompt change that improves average quality by 5% but introduces a 2% regression on edge cases requires the same cost-benefit analysis as a code change that speeds up the happy path but breaks an uncommon workflow.

2. A/B Testing Framework

Figure 31.4.1 illustrates the end-to-end A/B testing pipeline, from hash-based traffic splitting through to online metric collection and statistical evaluation. Code Fragment 31.4.3 below puts this into practice.

Request Traffic Splitter Hash-based 50% 50% Variant A Prompt v2.1 Variant B Prompt v2.2 Online Evaluation Metrics + Statistics
Figure 31.4.2: A/B testing pipeline for LLM prompt variants with hash-based traffic splitting and online metric collection.

Code Fragment 31.4.4 demonstrates this approach in practice.


# Define ABExperiment; implement assign, get_prompt
# Key operations: results display, experiment framework, data protection
import hashlib, random
from dataclasses import dataclass

@dataclass
class ABExperiment:
 """Simple A/B test for prompt variants."""
 name: str
 variant_a: str
 variant_b: str
 traffic_split: float = 0.5 # fraction going to variant B

 def assign(self, user_id: str) -> str:
 """Deterministic assignment based on user ID hash."""
 h = hashlib.md5(f"{self.name}:{user_id}".encode()).hexdigest()
 bucket = int(h[:8], 16) / 0xFFFFFFFF
 if bucket < self.traffic_split:
 return "B"
 return "A"

 def get_prompt(self, user_id: str) -> str:
 variant = self.assign(user_id)
 return self.variant_a if variant == "A" else self.variant_b

exp = ABExperiment(
 name="summarizer_prompt",
 variant_a="Summarize the following text:\n{text}",
 variant_b="Write a 2-sentence summary:\n{text}",
)
for uid in ["user_101", "user_202", "user_303"]:
 print(f"{uid} -> variant {exp.assign(uid)}")
user_101 -> variant A user_202 -> variant B user_303 -> variant A
Code Fragment 31.4.2: Define ABExperiment; implement assign, get_prompt

3. Online Evaluation and Feedback Loops

This snippet implements an online evaluation pipeline that collects user feedback and logs quality metrics in production.


# Define FeedbackCollector; implement log, summary
# Key operations: structured logging
from dataclasses import dataclass, field
from datetime import datetime
import statistics

@dataclass
class FeedbackCollector:
 """Collect and aggregate user feedback for LLM outputs."""
 records: list = field(default_factory=list)

 def log(self, request_id: str, variant: str, rating: int,
 feedback_text: str = "", latency_ms: float = 0):
 self.records.append({
 "request_id": request_id, "variant": variant,
 "rating": rating, "feedback": feedback_text,
 "latency_ms": latency_ms,
 "timestamp": datetime.utcnow().isoformat(),
 })

 def summary(self):
 by_variant = {}
 for r in self.records:
 v = r["variant"]
 by_variant.setdefault(v, []).append(r["rating"])
 return {
 v: {"mean": statistics.mean(ratings), "n": len(ratings)}
 for v, ratings in by_variant.items()
 }
Code Fragment 31.4.3: This FeedbackCollector class captures structured user feedback (ratings, corrections, metadata) alongside the prompt version that produced each response. The summary method aggregates feedback by variant, enabling data-driven comparison of prompt versions. This feedback data also feeds the data flywheel for continuous improvement.
Tip

Set a minimum sample size before drawing conclusions from A/B test results. A common mistake is stopping a test after 50 queries because variant B "looks 8% better." With small samples, random fluctuation easily produces 8% swings. Use at least 200 observations per variant for rating-based metrics, and 500 or more per variant for binary metrics like thumbs-up rate, to achieve statistical power above 80%.

Over time, the feedback from A/B tests feeds into a continuous improvement loop. Figure 31.4.3 depicts this data flywheel, where user interactions generate feedback that becomes curated evaluation data for model improvement. Code Fragment 31.4.4 below puts this into practice.

Data Flywheel User Interactions Feedback + Logs Curated Eval Data Improved Model
Figure 31.4.3: The data flywheel turns production usage into training data, creating a self-improving cycle.

4. Model Registry

4. Model Registry Intermediate Comparison
Registry Feature MLflow W&B HuggingFace Hub
Model versioningYes (stages)Yes (aliases)Yes (revisions)
Prompt versioningVia artifactsVia artifactsVia model card
A/B experiment trackingNativeNativeLimited
Deployment integrationSageMaker, Azure MLLaunchInference Endpoints
Self-hosted optionYes (open source)EnterpriseYes (enterprise)

Code Fragment 31.4.4 demonstrates this approach in practice.


# Experiment tracking setup
# Key operations: monitoring and metrics, structured logging, cost tracking
import mlflow

# Log a prompt experiment to MLflow
with mlflow.start_run(run_name="prompt_v2.1_test"):
 mlflow.log_param("prompt_version", "v2.1")
 mlflow.log_param("model", "gpt-4o-mini")
 mlflow.log_param("temperature", 0.7)

 # Log evaluation metrics
 mlflow.log_metric("mean_rating", 4.2)
 mlflow.log_metric("hallucination_rate", 0.03)
 mlflow.log_metric("p50_latency_ms", 820)
 mlflow.log_metric("cost_per_request", 0.0023)

 # Log the prompt template as an artifact
 mlflow.log_text(
 "Write a 2-sentence summary of:\n{text}",
 "prompt_template.txt"
 )
Code Fragment 31.4.4: Logging prompt experiments to MLflow with full parameter tracking, metric recording, and artifact storage. Notice how the prompt template, model name, temperature, and evaluation scores are all captured together in a single experiment run. This ensures that any result can be reproduced exactly and that comparisons between prompt versions use consistent methodology.
Note

Prompt versioning should capture not just the template text but also the model name, temperature, max tokens, system prompt, and any few-shot examples. A prompt that works well with GPT-4o may fail with Claude or Llama, so the model is part of the prompt's identity.

Warning

A/B tests on LLM outputs require larger sample sizes than traditional web experiments because LLM quality metrics (like human ratings or LLM-as-Judge scores) have high variance. Plan for at least 200 to 500 samples per variant before drawing conclusions, and always compute confidence intervals rather than relying on point estimates.

Key Insight

The data flywheel is the most powerful long-term advantage of a production LLM system. Every user interaction generates data that can improve evaluation sets, fine-tuning corpora, and retrieval indices. Teams that invest in feedback collection infrastructure early will compound improvements over time, while teams that skip it remain stuck with static prompts and models.

Self-Check

1. Why should prompt versioning use content-addressable hashing rather than sequential version numbers?

Show Answer
Content-addressable hashing ensures that the version ID is derived from the prompt content itself, making it impossible to accidentally assign the same version number to different content or to have two different systems disagree on what "v3" means. It also makes deduplication trivial: identical prompts always produce the same hash.

2. Why is hash-based traffic splitting preferred over random assignment in A/B tests?

Show Answer
Hash-based splitting is deterministic: the same user always sees the same variant across sessions. Random assignment could show different variants to the same user on different requests, contaminating the experiment and making it impossible to measure the effect of a variant on user behavior over time.

3. What is a data flywheel and why is it important for LLM applications?

Show Answer
A data flywheel is a virtuous cycle where production usage generates feedback data, which is curated into evaluation and training sets, which improves the model, which generates better interactions, producing more valuable data. It is important because LLM applications that leverage this cycle compound their quality improvements over time, creating a durable competitive advantage.

4. What metadata should be stored alongside a prompt version for full reproducibility?

Show Answer
Full reproducibility requires storing the prompt template, model name and version, temperature, max tokens, top-p, system prompt, few-shot examples, stop sequences, and any post-processing logic. The model is part of the prompt's identity because the same template can produce very different results with different models.

5. Why do LLM A/B tests require larger sample sizes than traditional web experiments?

Show Answer
LLM quality metrics (human ratings, LLM-as-Judge scores, task success rates) have much higher variance than binary click/conversion metrics. The probabilistic nature of LLM outputs means even identical inputs produce different results across runs. This high variance requires more samples to achieve statistical significance and reliable effect size estimates.
Real-World Scenario: Building a Data Flywheel for Customer Support Summarization

Who: An AI product team at a SaaS company with 2 million support tickets per year

Situation: The team deployed an LLM to auto-summarize support conversations for agent handoffs. Initial quality was acceptable but inconsistent.

Problem: Without structured feedback, the team had no way to identify which summaries were helpful and which were misleading. A/B testing prompt variants took weeks because they lacked infrastructure to track variant assignments.

Dilemma: Investing in feedback infrastructure would delay the next feature launch by a month. Skipping it meant continuing to iterate blindly on prompt quality.

Decision: They built a lightweight feedback loop: thumbs up/down on each summary, plus a "needs correction" option that captured the agent's edited version.

How: Each summary was tagged with the prompt version hash, model name, and temperature. Weekly reports aggregated ratings by prompt version. Edited summaries became evaluation gold data and eventually fine-tuning examples.

Result: After three months, the feedback dataset contained 12,000 rated summaries and 800 corrected versions. A fine-tuned model trained on corrections scored 23% higher on faithfulness than the original prompted approach.

Lesson: The data flywheel is not theoretical; it produces measurable quality gains within months, but only if feedback collection infrastructure is built before optimization begins.

Key Takeaways
  • Version prompts with content-addressable hashing and store the complete configuration (model, temperature, system prompt, few-shot examples) alongside the template.
  • Use hash-based traffic splitting for deterministic A/B assignment that remains consistent across user sessions.
  • Collect structured feedback (thumbs up/down, ratings, corrections) on every production response to fuel the data flywheel.
  • Plan for large sample sizes (200 to 500 per variant) and compute confidence intervals for LLM A/B tests due to high output variance.
  • Track experiments in a model registry (MLflow, W&B) that captures prompts, metrics, and model configurations together.
  • The data flywheel is a production LLM system's most valuable long-term asset; invest in feedback infrastructure early.
Research Frontier

Open Questions:

  • What does a mature LLMOps practice look like, and how does it differ from traditional MLOps? LLMs challenge existing MLOps assumptions about model training, versioning, and A/B testing.
  • How should organizations manage prompt versioning and migration when prompts are a critical part of the application logic?

Recent Developments (2024-2025):

  • Prompt management platforms (2024-2025) like PromptLayer, Humanloop, and Langfuse added version control, A/B testing, and rollback capabilities for prompts, treating them as first-class deployment artifacts.

Explore Further: Set up a prompt versioning workflow for an LLM application using an open-source tool. Create three prompt versions, A/B test them against your evaluation set, and practice a rollback when one underperforms.

Exercises

Exercise 31.4.1: Prompt Versioning Conceptual

Explain why prompts should be treated as versioned artifacts, similar to source code. Describe the minimum metadata that should be stored alongside each prompt version.

Answer Sketch

Prompts directly control model behavior, so untracked changes can cause regressions. Minimum metadata: content hash (for identity), author, timestamp, target model, evaluation scores on a standard test set, deployment status (draft/staging/production), and a description of the change. This enables rollback, A/B testing, and audit trails, just as Git provides for code.

Exercise 31.4.2: A/B Testing for LLMs Conceptual

Describe how to set up an A/B test comparing two system prompt variants for a customer support chatbot. Include: traffic splitting, metrics to track, statistical test to use, and minimum sample size calculation.

Answer Sketch

Traffic splitting: randomly assign users (not requests) to variant A or B using a consistent hash on user_id. Metrics: task completion rate (primary), user satisfaction score, average handle time, escalation rate. Statistical test: chi-squared test for completion rate, Mann-Whitney U for satisfaction scores. Sample size: for a 5% minimum detectable effect on a 60% baseline completion rate at 80% power and 5% significance, you need approximately 1,500 users per variant. Run for at least one full business cycle (7 days) to account for daily patterns.

Exercise 31.4.3: Feedback Loop Design Coding

Design a feedback collection system that captures thumbs-up/thumbs-down ratings, optional text feedback, and automatic quality signals (response length, latency, tool call success rate). Describe the data pipeline from collection to actionable improvements.

Answer Sketch

Collection: UI sends feedback events with {trace_id, rating, comment, timestamp}. Pipeline: (1) Store in an analytics database joined with trace data. (2) Aggregate daily quality metrics. (3) Flag low-rated responses for human review. (4) Export highly-rated (prompt, response) pairs to a fine-tuning dataset. (5) Export low-rated responses to a "hard examples" evaluation set. (6) Surface patterns in negative feedback (e.g., topic clustering) to guide prompt improvements. The key is closing the loop: feedback must flow back into evaluation data and system improvements.

Exercise 31.4.4: LLMOps Pipeline Analysis

Sketch the end-to-end LLMOps pipeline for a production chatbot, from prompt change to deployment. Include: prompt version control, automated evaluation, A/B testing, monitoring, and feedback-driven improvement. Identify the manual vs. automated steps.

Answer Sketch

Pipeline: (1) Engineer edits prompt in version control (manual). (2) CI runs automated evaluation suite (automated). (3) If scores pass threshold, deploy to staging (automated). (4) Run canary tests on staging (automated). (5) If canary passes, start A/B test with 10% traffic (automated). (6) After reaching sample size, analyze results (semi-automated). (7) If A/B test wins, promote to 100% (manual approval, automated execution). (8) Monitor quality metrics (automated). (9) Collect user feedback (automated). (10) Review feedback and plan next iteration (manual). Steps 2-5 and 8-9 should be fully automated; steps 1, 6-7, and 10 benefit from human judgment.

Exercise 31.4.5: Incident Response Discussion

Your LLM chatbot starts generating offensive responses after a provider model update. Describe the incident response process, from detection to resolution, including immediate containment, investigation, and post-mortem steps.

Answer Sketch

Detection: output guardrail alerts or user reports trigger the incident. Containment (minutes): roll back to the previous model version or activate a safe-mode prompt with stricter instructions. Investigation (hours): analyze flagged responses, identify the root cause (model update, prompt interaction, new attack vector). Resolution: implement additional guardrails, update the canary test suite to cover the failure mode, and coordinate with the provider. Post-mortem: document the timeline, root cause, impact, and preventive measures. Update the incident playbook and add regression tests.

What Comes Next

In the next chapter, Chapter 32: Safety, Ethics, and Regulation, we shift to security threats, hallucination, bias, regulation, and governance for LLM systems.

Further Reading & References
Core References

Zaharia, M. et al. (2024). The Shift from Models to Compound AI Systems. Berkeley AI Research.

Influential blog post arguing that production AI has shifted from single models to compound systems combining retrieval, tools, and multiple models. Frames LLMOps as system-level optimization rather than model-level tuning. Essential context for understanding modern AI operations.

Foundational Essay

MLflow. (2024). MLflow: An Open Source Platform for the Machine Learning Lifecycle.

Documentation for MLflow's experiment tracking, model registry, and deployment tools, now with LLM-specific features like prompt tracking. Covers the complete model lifecycle from experimentation to production. Recommended for teams standardizing their ML operations workflow.

MLOps Platform

Weights & Biases. (2024). W&B Prompts: Prompt Engineering and LLMOps.

Guide to W&B's prompt management features including versioning, evaluation, and A/B testing integration. Demonstrates practical prompt lifecycle management with dashboards. Useful for teams that need collaborative prompt development with version control.

MLOps Platform

Shankar, S. et al. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.

Examines the reliability of using LLMs to evaluate other LLMs, revealing systematic biases in automated evaluation. Proposes calibration methods for LLM-as-judge approaches. Critical reading for teams relying on automated evaluation in production.

Evaluation Research

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.

The definitive guide to running statistically rigorous A/B tests in production systems, written by practitioners from Microsoft and Google. Its methodology directly applies to testing prompt variants and model versions. Essential for data-driven LLMOps teams.

Methodology

HuggingFace. (2024). Hugging Face Hub Documentation.

Documentation for the Hugging Face Hub's model hosting, dataset sharing, and model versioning features. Covers model cards, Spaces deployment, and inference endpoints for production use. Essential reference for teams using Hugging Face as their model registry and deployment platform.

Platform Docs