Part VIII: Evaluation & Production
Chapter 30: Observability, Monitoring & MLOps

LLM Experiment Reproducibility

"If you cannot reproduce it, you did not discover it. You just got lucky once."

Sentinel Sentinel, Experimentally Rigorous AI Agent
Big Picture

Reproducibility in LLM experiments is harder than in traditional ML, and also more important. Traditional ML experiments depend on data, code, and hyperparameters. LLM experiments add several new dimensions: prompt templates, provider API versions, retrieval configurations, tool definitions, and external service behaviors. When you cannot reproduce an experiment, you cannot trust its results, compare it fairly against alternatives, or debug regressions. This section covers the tools and practices that make LLM experiments reproducible, from prompt versioning through containerized execution environments. The experimental design principles from Section 30.2 define what "reproducible" means in the LLM context.

Prerequisites

This section requires the evaluation and observability concepts from Section 30.1 through Section 30.2. Familiarity with agent systems from Section 22.1 and multi-agent patterns from Section 24.1 provides context for evaluating complex agentic applications.

A cartoon scientist robot performing the same experiment in two identical side-by-side laboratory setups and comparing results on a clipboard, with one lab having a slight variation producing different results, illustrating the challenge of reproducibility.
LLM experiments add several new reproducibility dimensions beyond traditional ML: prompt templates, API versions, retrieval configurations, and external service behaviors.

1. Why LLM Reproducibility Is Hard

LLM experiments face reproducibility challenges that do not exist in traditional machine learning. Even with identical code, data, and configuration, you may get different results because of factors outside your control: provider-side model updates, non-deterministic GPU computation, changing API behaviors, and evolving safety filters.

Fun Fact

A 2024 survey found that fewer than 15% of published LLM papers provided enough detail to reproduce their main results. The most common missing ingredient? The exact system prompt, which researchers often treat like a family recipe: shared reluctantly, if at all.

The LLM Reproducibility Stack

Tip

Version your prompts in a dedicated file (not inline in code) and track them in git alongside your application code. When a regression surfaces weeks later, git blame on your prompt file is often the fastest path to the root cause. Treat prompt changes with the same rigor as database migration scripts.

To fully reproduce an LLM experiment, you must version and capture every layer of the stack: Figure 30.3.1 shows the five layers that must be versioned.

Infrastructure Docker, GPU type, API region Code + Dependencies Git SHA, requirements.txt, library versions Data + Embeddings DVC hash, embedding model version, index snapshot Model Configuration Provider, model version, temperature, seed, system_fingerprint Prompts + Few-Shot Examples Template version, variable bindings, tool schemas Full control Partial control Full control Full control Full control
Figure 30.3.1: The five layers that must be versioned for fully reproducible LLM experiments.

The practical implication is that reproducibility in LLM work is a spectrum, not a binary. At one end, you can reproduce the exact same evaluation run with the same cached model responses. At the other end, you can only reproduce the experimental methodology while accepting that outputs will differ. Most production teams operate somewhere in the middle: pinning model versions, versioning prompts and configs, and using statistical testing (from Section 29.2) to determine whether observed differences reflect real changes or noise.

2. Configuration Management with Hydra

Hydra is a configuration framework that enables composable, hierarchical configuration with overrides from the command line. It is particularly useful for LLM experiments because it can manage the many interacting parameters (model settings, prompt templates, retrieval parameters, evaluation settings) in a structured, versioned way. Code Fragment 30.3.3 below puts this into practice.

# config/experiment.yaml
defaults:
 - model: gpt4o
 - prompt: rag_v2
 - retrieval: dense_rerank
 - eval: standard

experiment:
 name: rag_ablation_cot
 seed: 42
 num_eval_seeds: 5

# config/model/gpt4o.yaml
provider: openai
model_name: gpt-4o-2024-08-06
temperature: 0.0
max_tokens: 1024
seed: 42

# config/retrieval/dense_rerank.yaml
embedding_model: text-embedding-3-small
top_k: 10
reranker: cohere-rerank-v3
rerank_top_n: 3
chunk_size: 512
chunk_overlap: 50
Code Fragment 30.3.1: config/experiment.yaml

# implement run_experiment, evaluate_pipeline, save_results
# Key operations: results display, retrieval pipeline, evaluation logic
import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(version_base=None, config_path="config", config_name="experiment")
def run_experiment(cfg: DictConfig):
 """Run an LLM experiment with full configuration tracking."""
 # Hydra automatically saves the full resolved config
 print(OmegaConf.to_yaml(cfg))

 # Access nested config values
 model_name = cfg.model.model_name
 temperature = cfg.model.temperature
 top_k = cfg.retrieval.top_k

 # Run evaluation with tracked configuration
 results = evaluate_pipeline(cfg)
 save_results(results, cfg)

def evaluate_pipeline(cfg: DictConfig) -> dict:
 """Run the evaluation pipeline with the given config."""
 # Pipeline implementation using config values...
 return {"accuracy": 0.847, "faithfulness": 0.91}

def save_results(results: dict, cfg: DictConfig):
 """Save results alongside the full config for reproducibility."""
 import json
 output = {
 "config": OmegaConf.to_container(cfg, resolve=True),
 "results": results,
 }
 with open("experiment_results.json", "w") as f:
 json.dump(output, f, indent=2)

# Run with overrides from command line:
# python experiment.py model=gpt4o_mini retrieval.top_k=20
Code Fragment 30.3.2: implement run_experiment, evaluate_pipeline, save_results
Hydra Output Directory

Hydra automatically creates a timestamped output directory for each experiment run, containing the fully resolved configuration, any output files, and logs. This makes every run self-documenting. Combined with git commit tracking, this gives you everything needed to reproduce any past experiment: the code (git SHA), the configuration (Hydra output), and the results.

3. Dataset Versioning with DVC

DVC (Data Version Control) extends git to handle large files and datasets. For LLM experiments, DVC tracks evaluation datasets, knowledge base snapshots, and embedding indexes. By storing lightweight pointer files in git and the actual data in cloud storage (S3, GCS, Azure Blob), DVC provides versioning without bloating the git repository. Code Fragment 30.3.7 below puts this into practice.

# Initialize DVC in your project
$ dvc init
$ dvc remote add -d storage s3://my-bucket/llm-experiments

# Track evaluation dataset
$ dvc add data/eval_{dataset}_v3.jsonl
$ git add data/eval_dataset_v3.jsonl.dvc data/.gitignore
$ git commit -m "Track eval dataset v3"

# Track knowledge base snapshot
$ dvc add data/knowledge_base/
$ git add data/knowledge_{base}.dvc
$ git commit -m "Track knowledge base snapshot 2024-Q3"

# Push data to remote storage
$ dvc push

# Reproduce exact data state from any git commit
$ git checkout experiment-2024-08-15
$ dvc checkout # restores data files to match that commit
Code Fragment 30.3.3: Version-controlling datasets and model artifacts with DVC so every experiment is fully reproducible.

4. Experiment Tracking with MLflow and W&B

Experiment tracking platforms record every run with its configuration, metrics, artifacts, and metadata. They provide dashboards for comparing runs, visualizing trends, and identifying the best configurations. Both MLflow (open source, self-hostable) and Weights & Biases (cloud-based, with a free tier) are widely used in the LLM community. These platforms complement Section 30.1 by focusing on offline experiment comparison rather than real-time production monitoring. Code Fragment 30.3.6 below puts this into practice.


# implement track_experiment_mlflow
# Key operations: retrieval pipeline, prompt construction, evaluation logic
import mlflow
from omegaconf import DictConfig, OmegaConf

def track_experiment_mlflow(cfg: DictConfig, results: dict):
 """Track an LLM experiment run with MLflow."""
 mlflow.set_experiment(cfg.experiment.name)

 with mlflow.start_run():
 # Log all configuration parameters
 flat_cfg = {
 k: str(v) for k, v
 in OmegaConf.to_container(cfg, resolve=True).items()
 }
 mlflow.log_params({
 "model": cfg.model.model_name,
 "temperature": cfg.model.temperature,
 "top_k": cfg.retrieval.top_k,
 "reranker": cfg.retrieval.reranker,
 "seed": cfg.experiment.seed,
 })

 # Log evaluation metrics
 mlflow.log_metrics({
 "accuracy": results["accuracy"],
 "faithfulness": results["faithfulness"],
 "latency_p50_ms": results["latency_p50"],
 "cost_per_query_usd": results["cost_per_query"],
 })

 # Log prompt template as artifact
 mlflow.log_text(cfg.prompt.template, "prompt_template.txt")

 # Log full config
 mlflow.log_text(OmegaConf.to_yaml(cfg), "full_config.yaml")

 # Tag the run for easy filtering
 mlflow.set_tags({
 "experiment_type": "ablation",
 "git_sha": get_git_sha(),
 })
Code Fragment 30.3.4: Tracking LLM experiment parameters, metrics, and artifacts with MLflow for comparison across runs and model registry integration.

Experiment Tracking Platform Comparison

Experiment Tracking Platform Comparison
Feature MLflow Weights & Biases DVC (with Studio)
Open source Yes (fully) No (cloud service) Yes (core), paid UI
Self-hosting Yes Enterprise only Yes
Prompt tracking As artifacts Tables + artifacts Via params/files
Comparison UI Good Excellent Good (Studio)
Data versioning Artifacts (limited) Artifacts + tables Native (core strength)
Cost tracking Custom metrics Custom metrics Custom metrics
Library Shortcut: Weights and Biases in Practice

Track LLM experiment parameters, prompts, and evaluation scores with W&B for visual comparison.

# pip install wandb
import wandb

wandb.init(project="llm-experiments", name="prompt_v3_gpt4o")
wandb.config.update({
 "model": "gpt-4o",
 "temperature": 0.2,
 "prompt_version": "v3",
 "eval_dataset": "support_tickets_v2",
})

# Log evaluation metrics
wandb.log({
 "faithfulness": 0.89,
 "answer_relevancy": 0.92,
 "latency_p95_ms": 1420,
 "cost_per_request_usd": 0.018,
})

# Log the prompt template as an artifact
artifact = wandb.Artifact("prompt_template", type="prompt")
artifact.add_file("prompts/support_summary_v3.txt")
wandb.log_artifact(artifact)
wandb.finish()
Code Fragment 30.3.5: pip install wandb
Key Insight

The choice between MLflow and W&B often comes down to infrastructure preferences. Choose MLflow when self-hosting and data sovereignty are requirements, when you want a fully open-source stack, or when you are already using the MLflow ecosystem. Choose W&B when you want the best visualization and collaboration UI, when cloud hosting is acceptable, or when you value features like report generation and team dashboards. Both integrate well with Hydra and DVC.

5. Containerized Reproducibility with Docker

Docker containers provide the ultimate reproducibility guarantee for the code and infrastructure layers. A Dockerfile that pins every dependency version, combined with versioned data (DVC) and configuration (Hydra), enables exact reproduction of any experiment on any machine. Figure 30.3.2 combines all elements into a complete reproducibility workflow.

# Dockerfile for reproducible LLM experiments
FROM python:3.11-slim

# Pin system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
 git curl && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy and install pinned Python dependencies first (cache-friendly)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Record build metadata for traceability
ARG GIT_SHA=unknown
ARG BUILD_DATE=unknown
ENV GIT_SHA=${GIT_SHA} BUILD_DATE=${BUILD_DATE}

# Default command: run experiment with Hydra
ENTRYPOINT ["python", "experiment.py"]
Code Fragment 30.3.6: Containerizing LLM experiments with Docker to ensure consistent environments across development and production.

Code Fragment 30.3.7 below shows how to build and run the container with configuration overrides and metadata tags.

# Build with metadata
$ docker build \
 --build-arg GIT_SHA=$(git rev-parse HEAD) \
 --build-arg BUILD_{DATE}=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
 -t llm-experiment:v1.2.0 .

# Run experiment with config overrides
$ docker run \
 -e OPENAI_{API}_{KEY}=$OPENAI_API_KEY \
 -v $(pwd)/data:/app/data \
 -v $(pwd)/outputs:/app/outputs \
 llm-experiment:v1.2.0 \
 model=gpt4o retrieval.top_k=5 experiment.seed=42
Code Fragment 30.3.7: Building the Docker image with Git SHA and date metadata baked in, then launching an experiment run with Hydra config overrides passed at the command line.
Hydra Config Versioned YAML Git + DVC Code + data Docker Build Pinned deps Run Experiment Deterministic MLflow / W&B Track results Each step captures a reproducibility dimension: Hydra = parameters | Git = code | DVC = data | Docker = environment | MLflow = results
Figure 30.3.2: Complete reproducibility workflow combining configuration, code, data, environment, and result tracking.
API-Based Models Break Full Reproducibility

When using API-based models (OpenAI, Anthropic, Google), you can never achieve full bit-level reproducibility because the provider controls the inference hardware and may change it between requests. The best you can do is pin the model version, set temperature to 0, provide a seed, and log the system_fingerprint. For experiments requiring guaranteed reproducibility, use locally hosted open-weight models where you control the entire stack.

Self-Check

1. List five layers of the LLM reproducibility stack and explain what must be versioned at each layer.

Show Answer
(1) Prompt layer: system prompts, user templates, few-shot examples, tool descriptions must be versioned in files tracked by git. (2) Model layer: provider name, model version (with date suffix), temperature, seed, max_tokens must be captured in configuration files. (3) Data layer: evaluation datasets and knowledge base contents must be versioned with DVC or similar tools. (4) Code layer: application code (git SHA), library versions (requirements.txt) must be committed and pinned. (5) Infrastructure layer: Docker image with pinned base image, API endpoint configuration, and hardware specification for local models.

2. How does Hydra improve reproducibility compared to using command-line arguments or environment variables?

Show Answer
Hydra automatically saves the complete, fully resolved configuration for every experiment run to a timestamped output directory. This means every parameter value is recorded without any manual effort. Hydra also supports composition (combining multiple config files), type validation, and structured overrides. Command-line arguments and environment variables, by contrast, are ephemeral and not automatically recorded. You would need to manually log them, which is error-prone and often forgotten. Hydra makes configuration explicit, versioned, and automatically archived.

3. Why does DVC store pointer files in git rather than the actual data?

Show Answer
Large datasets and embedding indexes can be gigabytes or more, which would bloat the git repository, slow down cloning, and exceed GitHub's file size limits. DVC stores small pointer files (containing a hash of the data) in git and the actual data in remote storage (S3, GCS, etc.). When you check out a specific git commit and run dvc checkout, DVC uses the pointer file to download the exact data version from remote storage. This gives you git-like versioning semantics for large files without the storage overhead.

4. When would you choose MLflow over Weights & Biases for experiment tracking?

Show Answer
Choose MLflow when: (1) you require self-hosted infrastructure for data privacy or compliance reasons, (2) you want a fully open-source solution with no vendor lock-in, (3) your team already uses MLflow for traditional ML experiments, or (4) you need to integrate with tools in the MLflow ecosystem (model registry, deployment). W&B is preferred when you want the best visualization UI, team collaboration features, or managed cloud infrastructure.

5. Why can you not achieve full bit-level reproducibility with API-based LLM providers?

Show Answer
API providers control the inference infrastructure, including GPU type, parallelization strategy, and floating-point computation order. Even with temperature=0 and a fixed seed, floating-point arithmetic on GPUs is not perfectly deterministic because operations like matrix multiplication can be parallelized in different orders that produce slightly different rounding results. Additionally, providers may silently change the hardware or software stack between requests. The system_fingerprint field helps detect such changes, but cannot prevent them. For guaranteed reproducibility, you must host the model locally and control the entire inference stack.
Real-World Scenario: Reproducing a Winning Experiment Configuration Three Months Later

Who: Research engineering team at an NLP startup that had found a promising prompt and model configuration during a week of rapid experimentation

Situation: Three months after the initial experiments, the team needed to reproduce their best-performing configuration as a baseline for a new project. The original researcher had left the company.

Problem: Experiment notes were scattered across Slack messages, Jupyter notebooks, and a shared spreadsheet. The team could not determine the exact prompt version, model parameters, evaluation dataset version, or Python dependencies used in the winning run.

Dilemma: Investing in reproducibility infrastructure felt like overhead during rapid experimentation, but the cost of not having it was now weeks of wasted effort trying to recreate results.

Decision: After rebuilding the baseline (at significant cost), the team implemented a full reproducibility stack: Hydra for configuration, DVC for data versioning, MLflow for experiment tracking, and Docker for environment pinning.

How: Every experiment run was defined by a Hydra YAML config that captured the complete parameter set (prompt template, model name, temperature, seed, evaluation dataset path). DVC tracked the evaluation dataset and knowledge base versions alongside git commits. MLflow logged all configs, metrics, and output artifacts. A Docker image pinned all Python dependencies.

Result: Six months after implementation, a similar situation arose: a team member needed to reproduce a 4-month-old experiment. They found the MLflow run, checked out the associated git commit (which included Hydra config and DVC data pointers), built the Docker image, and reproduced the results within 0.3% of the original metrics in under 30 minutes.

Lesson: The upfront cost of reproducibility infrastructure (Hydra, DVC, MLflow, Docker) is a fraction of the cost of even one failed attempt to reproduce important results months later.

Tip: Monitor Latency by Percentile, Not Average

Report p50, p90, p95, and p99 latencies separately. A healthy p50 can hide a terrible p99, and it is the slowest requests that generate the most user complaints and support tickets.

Key Takeaways
Research Frontier

Open Questions in LLM Reproducibility (2024-2026):

Explore Further: Set up a Hydra + DVC + MLflow stack for a simple LLM evaluation experiment, then have a colleague reproduce your results using only the logged artifacts. Measure how close their results come.

Exercises

Exercise 30.3.1: Reproducibility Challenges Conceptual

List five factors that make LLM experiments harder to reproduce than traditional ML experiments. For each factor, explain whether it is under the experimenter's control.

Answer Sketch

(1) Provider-side model updates (not under control). (2) Non-deterministic GPU computation (partially controllable with seeds). (3) System prompt versioning (under control if you version prompts). (4) API behavior changes and safety filters (not under control). (5) Evaluation dataset contamination in training data (not under control for API models). The key insight is that many irreproducibility sources are outside the experimenter's control, making documentation and logging even more critical.

Exercise 30.3.2: Experiment Logging Coding

Write a Python class ExperimentLogger that captures all parameters needed to reproduce an LLM experiment: model name, version, temperature, seed, system prompt hash, evaluation dataset hash, library versions, and timestamp. It should save to a JSON file and support loading for reproduction.

Answer Sketch

The class stores parameters in a dictionary. Use hashlib.sha256 to hash prompt text and dataset contents. Use pkg_resources or importlib.metadata to capture library versions. The save() method writes JSON with json.dump. The load() classmethod reads it back. Include a verify() method that checks whether the current environment matches the logged configuration and warns about discrepancies.

Exercise 30.3.3: Version Pinning Strategy Analysis

Your team uses GPT-4o through the API. Should you pin to a specific dated version (e.g., "gpt-4o-2024-08-06") or use the latest alias ("gpt-4o")? Analyze the tradeoffs for a production system vs. a research project.

Answer Sketch

Production: always pin to a dated version. Tradeoffs: you get stability and reproducibility, but miss automatic improvements and must manually update when the version is deprecated. Research: using the latest alias is acceptable for initial exploration but pin for final experiments. Tradeoffs: you get the newest capabilities automatically, but results are not reproducible across time. Best practice: pin in production, test new versions in staging, and promote after evaluation passes.

Exercise 30.3.4: Containerized Experiments Conceptual

Explain how Docker containers help with LLM experiment reproducibility. What aspects of reproducibility do containers address, and what aspects do they not address?

Answer Sketch

Containers address: OS environment, library versions, Python version, configuration files, and local model weights. They guarantee the code layer and infrastructure layer are identical. Containers do NOT address: API-side model changes (the container calls an external API that may change), non-deterministic GPU behavior, evaluation dataset updates stored outside the container, or rate limiting differences. For full reproducibility, combine containers with model version pinning, dataset versioning, and response caching.

Exercise 30.3.5: Response Caching for Reproducibility Coding

Design a response caching system that stores LLM API responses keyed by (model, prompt_hash, parameters_hash). Explain how this enables exact reproducibility and what the storage and invalidation tradeoffs are.

Answer Sketch

Key: SHA-256 hash of (model_name + model_version + prompt_text + json(parameters)). Value: full API response including tokens, finish_reason, and metadata. Storage: SQLite for local development, Redis or S3 for shared environments. Invalidation: never invalidate for reproducibility; instead, include the cache version in the key. Tradeoffs: storage grows linearly with unique requests (manageable for evaluations, problematic for production traffic). Provides exact reproducibility for any cached request regardless of provider changes.

What Comes Next

In the next section, Section 30.4: Arena-Style and Crowdsourced Evaluation, we explore arena-style evaluation methods that use real user preferences and pairwise comparisons to produce more trustworthy model rankings than static benchmarks.

Bibliography

Configuration Management

Yadan, O. (2019). "Hydra: A Framework for Elegantly Configuring Complex Applications." https://hydra.cc/

Documentation for Meta's configuration management framework that enables hierarchical configs, command-line overrides, and experiment sweeps. The standard tool for managing LLM experiment configurations across models, prompts, and hyperparameters. Essential for any team running systematic LLM experiments.
Configuration Management
Data Versioning

Iterative. (2024). "DVC: Data Version Control." https://dvc.org/

Documentation for the Git-based data versioning tool that tracks datasets, models, and pipeline stages. Enables reproducible data pipelines where every experiment links to specific data versions. Recommended for teams managing evaluation datasets and model artifacts.
Data Versioning
Experiment Tracking

Zaharia, M., Chen, A., Davidson, A., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." https://mlflow.org/

Introduces the MLflow platform for experiment tracking, model packaging, and deployment. Covers the tracking API, model registry, and integration with popular ML frameworks. Important for teams needing a self-hosted experiment tracking solution for LLM projects.
Experiment Tracking

Biewald, L. (2020). "Experiment Tracking with Weights and Biases." https://wandb.ai/

Documentation for the hosted experiment tracking platform with rich visualization, artifact management, and team collaboration features. Covers the integration patterns for logging LLM experiments, prompt versions, and evaluation results. Useful for teams wanting managed experiment tracking with minimal setup.
Experiment Tracking
Reproducibility Standards

Pineau, J., Vincent-Lamarre, P., Sinha, K., et al. (2021). "Improving Reproducibility in Machine Learning Research." arXiv:2003.12206

The NeurIPS reproducibility checklist paper that established community standards for reporting ML experiments. Covers what information must be included for reproducibility: code, data, hyperparameters, compute resources, and statistical significance. Critical reading for researchers publishing LLM evaluation results.
Reproducibility Standards