"If you cannot reproduce it, you did not discover it. You just got lucky once."
Sentinel, Experimentally Rigorous AI Agent
Reproducibility in LLM experiments is harder than in traditional ML, and also more important. Traditional ML experiments depend on data, code, and hyperparameters. LLM experiments add several new dimensions: prompt templates, provider API versions, retrieval configurations, tool definitions, and external service behaviors. When you cannot reproduce an experiment, you cannot trust its results, compare it fairly against alternatives, or debug regressions. This section covers the tools and practices that make LLM experiments reproducible, from prompt versioning through containerized execution environments. The experimental design principles from Section 30.2 define what "reproducible" means in the LLM context.
Prerequisites
This section requires the evaluation and observability concepts from Section 30.1 through Section 30.2. Familiarity with agent systems from Section 22.1 and multi-agent patterns from Section 24.1 provides context for evaluating complex agentic applications.
1. Why LLM Reproducibility Is Hard
LLM experiments face reproducibility challenges that do not exist in traditional machine learning. Even with identical code, data, and configuration, you may get different results because of factors outside your control: provider-side model updates, non-deterministic GPU computation, changing API behaviors, and evolving safety filters.
A 2024 survey found that fewer than 15% of published LLM papers provided enough detail to reproduce their main results. The most common missing ingredient? The exact system prompt, which researchers often treat like a family recipe: shared reluctantly, if at all.
The LLM Reproducibility Stack
Version your prompts in a dedicated file (not inline in code) and track them in git alongside your application code. When a regression surfaces weeks later, git blame on your prompt file is often the fastest path to the root cause. Treat prompt changes with the same rigor as database migration scripts.
To fully reproduce an LLM experiment, you must version and capture every layer of the stack: Figure 30.3.1 shows the five layers that must be versioned.
- Prompt layer: System prompts, user prompt templates, few-shot examples, tool descriptions
- Model layer: Provider, model name, version (date suffix), temperature, seed, max tokens
- Data layer: Evaluation dataset, knowledge base contents, embedding model version
- Code layer: Application code, library versions, configuration files
- Infrastructure layer: API endpoint, region, hardware (for local models), container image
The practical implication is that reproducibility in LLM work is a spectrum, not a binary. At one end, you can reproduce the exact same evaluation run with the same cached model responses. At the other end, you can only reproduce the experimental methodology while accepting that outputs will differ. Most production teams operate somewhere in the middle: pinning model versions, versioning prompts and configs, and using statistical testing (from Section 29.2) to determine whether observed differences reflect real changes or noise.
2. Configuration Management with Hydra
Hydra is a configuration framework that enables composable, hierarchical configuration with overrides from the command line. It is particularly useful for LLM experiments because it can manage the many interacting parameters (model settings, prompt templates, retrieval parameters, evaluation settings) in a structured, versioned way. Code Fragment 30.3.3 below puts this into practice.
# config/experiment.yaml
defaults:
- model: gpt4o
- prompt: rag_v2
- retrieval: dense_rerank
- eval: standard
experiment:
name: rag_ablation_cot
seed: 42
num_eval_seeds: 5
# config/model/gpt4o.yaml
provider: openai
model_name: gpt-4o-2024-08-06
temperature: 0.0
max_tokens: 1024
seed: 42
# config/retrieval/dense_rerank.yaml
embedding_model: text-embedding-3-small
top_k: 10
reranker: cohere-rerank-v3
rerank_top_n: 3
chunk_size: 512
chunk_overlap: 50
# implement run_experiment, evaluate_pipeline, save_results
# Key operations: results display, retrieval pipeline, evaluation logic
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(version_base=None, config_path="config", config_name="experiment")
def run_experiment(cfg: DictConfig):
"""Run an LLM experiment with full configuration tracking."""
# Hydra automatically saves the full resolved config
print(OmegaConf.to_yaml(cfg))
# Access nested config values
model_name = cfg.model.model_name
temperature = cfg.model.temperature
top_k = cfg.retrieval.top_k
# Run evaluation with tracked configuration
results = evaluate_pipeline(cfg)
save_results(results, cfg)
def evaluate_pipeline(cfg: DictConfig) -> dict:
"""Run the evaluation pipeline with the given config."""
# Pipeline implementation using config values...
return {"accuracy": 0.847, "faithfulness": 0.91}
def save_results(results: dict, cfg: DictConfig):
"""Save results alongside the full config for reproducibility."""
import json
output = {
"config": OmegaConf.to_container(cfg, resolve=True),
"results": results,
}
with open("experiment_results.json", "w") as f:
json.dump(output, f, indent=2)
# Run with overrides from command line:
# python experiment.py model=gpt4o_mini retrieval.top_k=20
Hydra automatically creates a timestamped output directory for each experiment run, containing the fully resolved configuration, any output files, and logs. This makes every run self-documenting. Combined with git commit tracking, this gives you everything needed to reproduce any past experiment: the code (git SHA), the configuration (Hydra output), and the results.
3. Dataset Versioning with DVC
DVC (Data Version Control) extends git to handle large files and datasets. For LLM experiments, DVC tracks evaluation datasets, knowledge base snapshots, and embedding indexes. By storing lightweight pointer files in git and the actual data in cloud storage (S3, GCS, Azure Blob), DVC provides versioning without bloating the git repository. Code Fragment 30.3.7 below puts this into practice.
# Initialize DVC in your project
$ dvc init
$ dvc remote add -d storage s3://my-bucket/llm-experiments
# Track evaluation dataset
$ dvc add data/eval_{dataset}_v3.jsonl
$ git add data/eval_dataset_v3.jsonl.dvc data/.gitignore
$ git commit -m "Track eval dataset v3"
# Track knowledge base snapshot
$ dvc add data/knowledge_base/
$ git add data/knowledge_{base}.dvc
$ git commit -m "Track knowledge base snapshot 2024-Q3"
# Push data to remote storage
$ dvc push
# Reproduce exact data state from any git commit
$ git checkout experiment-2024-08-15
$ dvc checkout # restores data files to match that commit
4. Experiment Tracking with MLflow and W&B
Experiment tracking platforms record every run with its configuration, metrics, artifacts, and metadata. They provide dashboards for comparing runs, visualizing trends, and identifying the best configurations. Both MLflow (open source, self-hostable) and Weights & Biases (cloud-based, with a free tier) are widely used in the LLM community. These platforms complement Section 30.1 by focusing on offline experiment comparison rather than real-time production monitoring. Code Fragment 30.3.6 below puts this into practice.
# implement track_experiment_mlflow
# Key operations: retrieval pipeline, prompt construction, evaluation logic
import mlflow
from omegaconf import DictConfig, OmegaConf
def track_experiment_mlflow(cfg: DictConfig, results: dict):
"""Track an LLM experiment run with MLflow."""
mlflow.set_experiment(cfg.experiment.name)
with mlflow.start_run():
# Log all configuration parameters
flat_cfg = {
k: str(v) for k, v
in OmegaConf.to_container(cfg, resolve=True).items()
}
mlflow.log_params({
"model": cfg.model.model_name,
"temperature": cfg.model.temperature,
"top_k": cfg.retrieval.top_k,
"reranker": cfg.retrieval.reranker,
"seed": cfg.experiment.seed,
})
# Log evaluation metrics
mlflow.log_metrics({
"accuracy": results["accuracy"],
"faithfulness": results["faithfulness"],
"latency_p50_ms": results["latency_p50"],
"cost_per_query_usd": results["cost_per_query"],
})
# Log prompt template as artifact
mlflow.log_text(cfg.prompt.template, "prompt_template.txt")
# Log full config
mlflow.log_text(OmegaConf.to_yaml(cfg), "full_config.yaml")
# Tag the run for easy filtering
mlflow.set_tags({
"experiment_type": "ablation",
"git_sha": get_git_sha(),
})
Experiment Tracking Platform Comparison
| Feature | MLflow | Weights & Biases | DVC (with Studio) |
|---|---|---|---|
| Open source | Yes (fully) | No (cloud service) | Yes (core), paid UI |
| Self-hosting | Yes | Enterprise only | Yes |
| Prompt tracking | As artifacts | Tables + artifacts | Via params/files |
| Comparison UI | Good | Excellent | Good (Studio) |
| Data versioning | Artifacts (limited) | Artifacts + tables | Native (core strength) |
| Cost tracking | Custom metrics | Custom metrics | Custom metrics |
Track LLM experiment parameters, prompts, and evaluation scores with W&B for visual comparison.
# pip install wandb
import wandb
wandb.init(project="llm-experiments", name="prompt_v3_gpt4o")
wandb.config.update({
"model": "gpt-4o",
"temperature": 0.2,
"prompt_version": "v3",
"eval_dataset": "support_tickets_v2",
})
# Log evaluation metrics
wandb.log({
"faithfulness": 0.89,
"answer_relevancy": 0.92,
"latency_p95_ms": 1420,
"cost_per_request_usd": 0.018,
})
# Log the prompt template as an artifact
artifact = wandb.Artifact("prompt_template", type="prompt")
artifact.add_file("prompts/support_summary_v3.txt")
wandb.log_artifact(artifact)
wandb.finish()
The choice between MLflow and W&B often comes down to infrastructure preferences. Choose MLflow when self-hosting and data sovereignty are requirements, when you want a fully open-source stack, or when you are already using the MLflow ecosystem. Choose W&B when you want the best visualization and collaboration UI, when cloud hosting is acceptable, or when you value features like report generation and team dashboards. Both integrate well with Hydra and DVC.
5. Containerized Reproducibility with Docker
Docker containers provide the ultimate reproducibility guarantee for the code and infrastructure layers. A Dockerfile that pins every dependency version, combined with versioned data (DVC) and configuration (Hydra), enables exact reproduction of any experiment on any machine. Figure 30.3.2 combines all elements into a complete reproducibility workflow.
# Dockerfile for reproducible LLM experiments
FROM python:3.11-slim
# Pin system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy and install pinned Python dependencies first (cache-friendly)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Record build metadata for traceability
ARG GIT_SHA=unknown
ARG BUILD_DATE=unknown
ENV GIT_SHA=${GIT_SHA} BUILD_DATE=${BUILD_DATE}
# Default command: run experiment with Hydra
ENTRYPOINT ["python", "experiment.py"]
Code Fragment 30.3.7 below shows how to build and run the container with configuration overrides and metadata tags.
# Build with metadata
$ docker build \
--build-arg GIT_SHA=$(git rev-parse HEAD) \
--build-arg BUILD_{DATE}=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
-t llm-experiment:v1.2.0 .
# Run experiment with config overrides
$ docker run \
-e OPENAI_{API}_{KEY}=$OPENAI_API_KEY \
-v $(pwd)/data:/app/data \
-v $(pwd)/outputs:/app/outputs \
llm-experiment:v1.2.0 \
model=gpt4o retrieval.top_k=5 experiment.seed=42
When using API-based models (OpenAI, Anthropic, Google), you can never achieve full bit-level reproducibility because the provider controls the inference hardware and may change it between requests. The best you can do is pin the model version, set temperature to 0, provide a seed, and log the system_fingerprint. For experiments requiring guaranteed reproducibility, use locally hosted open-weight models where you control the entire stack.
1. List five layers of the LLM reproducibility stack and explain what must be versioned at each layer.
Show Answer
2. How does Hydra improve reproducibility compared to using command-line arguments or environment variables?
Show Answer
3. Why does DVC store pointer files in git rather than the actual data?
Show Answer
dvc checkout, DVC uses the pointer file to download the exact data version from remote storage. This gives you git-like versioning semantics for large files without the storage overhead.4. When would you choose MLflow over Weights & Biases for experiment tracking?
Show Answer
5. Why can you not achieve full bit-level reproducibility with API-based LLM providers?
Show Answer
system_fingerprint field helps detect such changes, but cannot prevent them. For guaranteed reproducibility, you must host the model locally and control the entire inference stack.Who: Research engineering team at an NLP startup that had found a promising prompt and model configuration during a week of rapid experimentation
Situation: Three months after the initial experiments, the team needed to reproduce their best-performing configuration as a baseline for a new project. The original researcher had left the company.
Problem: Experiment notes were scattered across Slack messages, Jupyter notebooks, and a shared spreadsheet. The team could not determine the exact prompt version, model parameters, evaluation dataset version, or Python dependencies used in the winning run.
Dilemma: Investing in reproducibility infrastructure felt like overhead during rapid experimentation, but the cost of not having it was now weeks of wasted effort trying to recreate results.
Decision: After rebuilding the baseline (at significant cost), the team implemented a full reproducibility stack: Hydra for configuration, DVC for data versioning, MLflow for experiment tracking, and Docker for environment pinning.
How: Every experiment run was defined by a Hydra YAML config that captured the complete parameter set (prompt template, model name, temperature, seed, evaluation dataset path). DVC tracked the evaluation dataset and knowledge base versions alongside git commits. MLflow logged all configs, metrics, and output artifacts. A Docker image pinned all Python dependencies.
Result: Six months after implementation, a similar situation arose: a team member needed to reproduce a 4-month-old experiment. They found the MLflow run, checked out the associated git commit (which included Hydra config and DVC data pointers), built the Docker image, and reproduced the results within 0.3% of the original metrics in under 30 minutes.
Lesson: The upfront cost of reproducibility infrastructure (Hydra, DVC, MLflow, Docker) is a fraction of the cost of even one failed attempt to reproduce important results months later.
Report p50, p90, p95, and p99 latencies separately. A healthy p50 can hide a terrible p99, and it is the slowest requests that generate the most user complaints and support tickets.
- Version every layer of the stack. Prompts, model configuration, data, code, and infrastructure all need explicit versioning. Missing any one layer makes the experiment only partially reproducible.
- Use Hydra for configuration management. Hydra provides composable, hierarchical configuration with automatic archiving of every experiment's complete parameter set. This eliminates the most common source of "what settings did I use?" confusion.
- Version large data with DVC. Evaluation datasets, knowledge bases, and embedding indexes should be tracked with DVC to maintain full data lineage alongside code changes.
- Track experiments with MLflow or W&B. Record every run's configuration, metrics, and artifacts in a central platform. This enables comparison across runs and makes it easy to identify the best configuration.
- Containerize for environment reproducibility. Docker containers pin all dependencies and eliminate "works on my machine" problems. Combined with Hydra configs and DVC data, containers complete the reproducibility chain.
- Accept partial reproducibility with API models. When using hosted APIs, pin versions, set seeds, and log system fingerprints. For full control, use locally hosted open-weight models.
Open Questions in LLM Reproducibility (2024-2026):
- Reproducibility across providers: Can experiments run on one provider's infrastructure (e.g., Azure OpenAI) be meaningfully reproduced on another (e.g., direct OpenAI API)? Provider-specific optimizations create subtle behavioral differences even for the same model weights.
- Prompt version migration: When model versions change, prompts often need updating. Automated prompt adaptation tools that preserve behavioral equivalence across model versions are an active research area.
- Reproducibility reporting standards: The NeurIPS reproducibility checklist is being extended for LLM-specific experiments, requiring disclosure of prompt templates, API versions, system fingerprints, and evaluation seeds alongside traditional ML reproducibility requirements.
Explore Further: Set up a Hydra + DVC + MLflow stack for a simple LLM evaluation experiment, then have a colleague reproduce your results using only the logged artifacts. Measure how close their results come.
Exercises
List five factors that make LLM experiments harder to reproduce than traditional ML experiments. For each factor, explain whether it is under the experimenter's control.
Answer Sketch
(1) Provider-side model updates (not under control). (2) Non-deterministic GPU computation (partially controllable with seeds). (3) System prompt versioning (under control if you version prompts). (4) API behavior changes and safety filters (not under control). (5) Evaluation dataset contamination in training data (not under control for API models). The key insight is that many irreproducibility sources are outside the experimenter's control, making documentation and logging even more critical.
Write a Python class ExperimentLogger that captures all parameters needed to reproduce an LLM experiment: model name, version, temperature, seed, system prompt hash, evaluation dataset hash, library versions, and timestamp. It should save to a JSON file and support loading for reproduction.
Answer Sketch
The class stores parameters in a dictionary. Use hashlib.sha256 to hash prompt text and dataset contents. Use pkg_resources or importlib.metadata to capture library versions. The save() method writes JSON with json.dump. The load() classmethod reads it back. Include a verify() method that checks whether the current environment matches the logged configuration and warns about discrepancies.
Your team uses GPT-4o through the API. Should you pin to a specific dated version (e.g., "gpt-4o-2024-08-06") or use the latest alias ("gpt-4o")? Analyze the tradeoffs for a production system vs. a research project.
Answer Sketch
Production: always pin to a dated version. Tradeoffs: you get stability and reproducibility, but miss automatic improvements and must manually update when the version is deprecated. Research: using the latest alias is acceptable for initial exploration but pin for final experiments. Tradeoffs: you get the newest capabilities automatically, but results are not reproducible across time. Best practice: pin in production, test new versions in staging, and promote after evaluation passes.
Explain how Docker containers help with LLM experiment reproducibility. What aspects of reproducibility do containers address, and what aspects do they not address?
Answer Sketch
Containers address: OS environment, library versions, Python version, configuration files, and local model weights. They guarantee the code layer and infrastructure layer are identical. Containers do NOT address: API-side model changes (the container calls an external API that may change), non-deterministic GPU behavior, evaluation dataset updates stored outside the container, or rate limiting differences. For full reproducibility, combine containers with model version pinning, dataset versioning, and response caching.
Design a response caching system that stores LLM API responses keyed by (model, prompt_hash, parameters_hash). Explain how this enables exact reproducibility and what the storage and invalidation tradeoffs are.
Answer Sketch
Key: SHA-256 hash of (model_name + model_version + prompt_text + json(parameters)). Value: full API response including tokens, finish_reason, and metadata. Storage: SQLite for local development, Redis or S3 for shared environments. Invalidation: never invalidate for reproducibility; instead, include the cache version in the key. Tradeoffs: storage grows linearly with unique requests (manageable for evaluations, problematic for production traffic). Provides exact reproducibility for any cached request regardless of provider changes.
What Comes Next
In the next section, Section 30.4: Arena-Style and Crowdsourced Evaluation, we explore arena-style evaluation methods that use real user preferences and pairwise comparisons to produce more trustworthy model rankings than static benchmarks.
Bibliography
Yadan, O. (2019). "Hydra: A Framework for Elegantly Configuring Complex Applications." https://hydra.cc/
Iterative. (2024). "DVC: Data Version Control." https://dvc.org/
Zaharia, M., Chen, A., Davidson, A., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." https://mlflow.org/
Biewald, L. (2020). "Experiment Tracking with Weights and Biases." https://wandb.ai/
Pineau, J., Vincent-Lamarre, P., Sinha, K., et al. (2021). "Improving Reproducibility in Machine Learning Research." arXiv:2003.12206
