LLM Experiment Reproducibility

Section 42.7

"If you cannot reproduce it, you did not discover it. You just got lucky once."

SentinelSentinel, Experimentally Rigorous AI Agent
Big Picture

Reproducibility in LLM experiments is harder than in traditional ML, and also more important. Traditional ML experiments depend on data, code, and hyperparameters. LLM experiments add several new dimensions: prompt templates, provider API versions, retrieval configurations, tool definitions, and external service behaviors. When you cannot reproduce an experiment, you cannot trust its results, compare it fairly against alternatives, or debug regressions. This section covers the tools and practices that make LLM experiments reproducible, from prompt versioning through containerized execution environments. The experimental design principles from Section 44.6 define what "reproducible" means in the LLM context.

Prerequisites

This section requires the evaluation and observability concepts from Section 44.4 through Section 44.6. Familiarity with agent systems from Section 26.1 and multi-agent patterns from Section 28.1 provides context for evaluating complex agentic applications.

42.7.1 Why LLM Reproducibility Is Hard

LLM experiments face reproducibility challenges traditional machine learning never had to deal with. Identical code, data, and configuration can produce different results because of factors outside your control: provider-side model updates that ship without changelogs, non-deterministic GPU computation, silently changing API behaviors, and safety filters that tighten between runs.

Fun Fact

A 2024 survey found that fewer than 15% of published LLM papers provided enough detail to reproduce their main results. The most common missing ingredient? The exact system prompt, which researchers often treat like a family recipe: shared reluctantly, if at all.

The LLM Reproducibility Stack

Tip

Version your prompts in a dedicated file (not inline in code) and track them in git alongside your application code. When a regression surfaces weeks later, git blame on your prompt file is often the fastest path to the root cause. Treat prompt changes with the same rigor as database migration scripts.

To fully reproduce an LLM experiment, you must version and capture every layer of the stack: Figure 42.7.1 shows the five layers that must be versioned.

Diagram: The LLM Reproducibility Stack
Figure 42.7.1a: The five layers that must be versioned for fully reproducible LLM experiments.

The practical implication is that reproducibility in LLM work is a spectrum, not a binary. At one end, you can reproduce the exact same evaluation run with the same cached model responses. At the other end, you can only reproduce the experimental methodology while accepting that outputs will differ. Most production teams operate somewhere in the middle: pinning model versions, versioning prompts and configs, and using statistical testing (from Section 42.2) to determine whether observed differences reflect real changes or noise.

42.7.2 Configuration Management with Hydra

Hydra is a configuration framework that enables composable, hierarchical configuration with overrides from the command line. It suits LLM experiments well because it manages the many interacting parameters (model settings, prompt templates, retrieval parameters, evaluation settings) in a structured, versioned way.

Table 42.7.1b: A composed Hydra configuration assembled from three files. The top-level experiment.yaml selects defaults that pull in model/gpt4o.yaml and retrieval/dense_rerank.yaml; any leaf value can be overridden from the command line (for example retrieval.top_k=20).
FileKeyValueNotes
config/experiment.yamldefaults.modelgpt4oPulls in config/model/gpt4o.yaml
defaults.promptrag_v2Pulls in config/prompt/rag_v2.yaml
defaults.retrievaldense_rerankPulls in config/retrieval/dense_rerank.yaml
defaults.evalstandardPulls in config/eval/standard.yaml
experiment.namerag_ablation_cotTag for run grouping
experiment.seed42Base random seed
experiment.num_eval_seeds5Replicates per arm for variance estimation
config/model/gpt4o.yamlprovideropenaiSDK / endpoint to use
model_namegpt-4o-2024-08-06Pinned snapshot, not the moving gpt-4o alias
temperature / max_tokens0.0 / 1024Greedy decoding, hard cap on output length
seed42Server-side seed (OpenAI honours it best-effort)
config/retrieval/dense_rerank.yamlembedding_modeltext-embedding-3-smallFirst-stage dense retrieval
top_k10Candidates before reranking
reranker / rerank_top_ncohere-rerank-v3 / 3Cross-encoder reranker, final survivors
chunk_size512Tokens per indexed chunk
chunk_overlap50Sliding-window overlap between chunks
# implement run_experiment, evaluate_pipeline, save_results
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(version_base=None, config_path="config", config_name="experiment")
def run_experiment(cfg: DictConfig):
    """Run an LLM experiment with full configuration tracking."""
    # Hydra automatically saves the full resolved config
    print(OmegaConf.to_yaml(cfg))
    # Access nested config values
    model_name = cfg.model.model_name
    temperature = cfg.model.temperature
    top_k = cfg.retrieval.top_k
    # Run evaluation with tracked configuration
    results = evaluate_pipeline(cfg)
    save_results(results, cfg)
def evaluate_pipeline(cfg: DictConfig) -> dict:
    """Run the evaluation pipeline with the given config."""
    # Pipeline implementation using config values...
    return {"accuracy": 0.847, "faithfulness": 0.91}
def save_results(results: dict, cfg: DictConfig):
    """Save results alongside the full config for reproducibility."""
    import json
    output = {
        "config": OmegaConf.to_container(cfg, resolve=True),
        "results": results,
        }
    with open("experiment_results.json", "w") as f:
        json.dump(output, f, indent=2)
        # Run with overrides from command line:
            # python experiment.py model=gpt4o_mini retrieval.top_k=20
Code Fragment 42.7.1c: Defines run_experiment and evaluate_pipeline
Note: Hydra Output Directory

Hydra automatically creates a timestamped output directory for each experiment run, containing the fully resolved configuration, any output files, and logs. This makes every run self-documenting. Combined with git commit tracking, this gives you everything needed to reproduce any past experiment: the code (git SHA), the configuration (Hydra output), and the results.

42.7.3 Dataset Versioning with DVC

DVC (Data Version Control) extends git to handle large files and datasets. For LLM experiments, DVC tracks evaluation datasets, knowledge base snapshots, and embedding indexes. By storing lightweight pointer files in git and the actual data in cloud storage (S3, GCS, Azure Blob), DVC provides versioning without bloating the git repository.

# Initialize DVC in your project
$ dvc init
$ dvc remote add -d storage s3://my-bucket/llm-experiments
# Track evaluation dataset
$ dvc add data/eval_{dataset}_v3.jsonl
$ git add data/eval_dataset_v3.jsonl.dvc data/.gitignore
$ git commit -m "Track eval dataset v3"
# Track knowledge base snapshot
$ dvc add data/knowledge_base/
$ git add data/knowledge_{base}.dvc
$ git commit -m "Track knowledge base snapshot 2024-Q3"
# Push data to remote storage
$ dvc push
# Reproduce exact data state from any git commit
$ git checkout experiment-2024-08-15
$ dvc checkout # restores data files to match that commit
Code Fragment 42.7.2: Version-controlling datasets and model artifacts with DVC so every experiment is fully reproducible.

42.7.4 Experiment Tracking with MLflow and W&B

Experiment tracking platforms record every run with its configuration, metrics, artifacts, and metadata. They provide dashboards for comparing runs, visualizing trends, and identifying the best configurations. Both MLflow (open source, self-hostable) and Weights & Biases (cloud-based, with a free tier) are widely used in the LLM community. These platforms complement observability by focusing on offline experiment comparison rather than real-time production monitoring.

# implement track_experiment_mlflow
import mlflow
from omegaconf import DictConfig, OmegaConf
def track_experiment_mlflow(cfg: DictConfig, results: dict):
    """Track an LLM experiment run with MLflow."""
    mlflow.set_experiment(cfg.experiment.name)
    with mlflow.start_run():
        # Log all configuration parameters
        flat_cfg = {
            k: str(v) for k, v
            in OmegaConf.to_container(cfg, resolve=True).items()
            }
        mlflow.log_params({
            "model": cfg.model.model_name,
            "temperature": cfg.model.temperature,
            "top_k": cfg.retrieval.top_k,
            "reranker": cfg.retrieval.reranker,
            "seed": cfg.experiment.seed,
            })
        # Log evaluation metrics
        mlflow.log_metrics({
            "accuracy": results["accuracy"],
            "faithfulness": results["faithfulness"],
            "latency_p50_ms": results["latency_p50"],
            "cost_per_query_usd": results["cost_per_query"],
            })
        # Log prompt template as artifact
        mlflow.log_text(cfg.prompt.template, "prompt_template.txt")
        # Log full config
        mlflow.log_text(OmegaConf.to_yaml(cfg), "full_config.yaml")
        # Tag the run for easy filtering
        mlflow.set_tags({
            "experiment_type": "ablation",
            "git_sha": get_git_sha(),
            })
Code Fragment 42.7.3: Tracking LLM experiment parameters, metrics, and artifacts with MLflow for comparison across runs and model registry integration.

Experiment Tracking Platform Comparison

Table 42.7.2a: Experiment Tracking Platform Comparison (as of 2026).
Feature MLflow Weights & Biases DVC (with Studio)
Open source Yes (fully) No (cloud service) Yes (core), paid UI
Self-hosting Yes Enterprise only Yes
Prompt tracking As artifacts Tables + artifacts Via params/files
Comparison UI Good Excellent Good (Studio)
Data versioning Artifacts (limited) Artifacts + tables Native (core strength)
Cost tracking Custom metrics Custom metrics Custom metrics
Library Shortcut: Weights and Biases in Practice

Track LLM experiment parameters, prompts, and evaluation scores with W&B for visual comparison.

Show code
# pip install wandb
import wandb
wandb.init(project="llm-experiments", name="prompt_v3_gpt4o")
wandb.config.update({
    "model": "gpt-4o",
    "temperature": 0.2,
    "prompt_version": "v3",
    "eval_dataset": "support_tickets_v2",
})
# Log evaluation metrics
wandb.log({
    "faithfulness": 0.89,
    "answer_relevancy": 0.92,
    "latency_p95_ms": 1420,
    "cost_per_request_usd": 0.018,
})
# Log the prompt template as an artifact
artifact = wandb.Artifact("prompt_template", type="prompt")
artifact.add_file("prompts/support_summary_v3.txt")
wandb.log_artifact(artifact)
wandb.finish()
Code Fragment 42.7.6: Pip install wandb.
Key Insight

The choice between MLflow and W&B often comes down to infrastructure preferences. Choose MLflow when self-hosting and data sovereignty are requirements, when you want a fully open-source stack, or when you are already using the MLflow ecosystem. Choose W&B when you want the best visualization and collaboration UI, when cloud hosting is acceptable, or when you value features like report generation and team dashboards. Both integrate well with Hydra and DVC.

42.7.5 Containerized Reproducibility with Docker

Docker containers provide the ultimate reproducibility guarantee for the code and infrastructure layers. A Dockerfile that pins every dependency version, combined with versioned data (DVC) and configuration (Hydra), enables exact reproduction of any experiment on any machine. Figure 42.7.2b combines all elements into a complete reproducibility workflow.

# Dockerfile for reproducible LLM experiments
FROM python:3.11-slim
# Pin system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
 git curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy and install pinned Python dependencies first (cache-friendly)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Record build metadata for traceability
ARG GIT_SHA=unknown
ARG BUILD_DATE=unknown
ENV GIT_SHA=${GIT_SHA} BUILD_DATE=${BUILD_DATE}
# Default command: run experiment with Hydra
ENTRYPOINT ["python", "experiment.py"]
Code Fragment 42.7.4: Containerizing LLM experiments with Docker to ensure consistent environments across development and production.

Code Fragment 42.7.5 below shows how to build and run the container with configuration overrides and metadata tags.

# Build with metadata
$ docker build \
 --build-arg GIT_SHA=$(git rev-parse HEAD) \
 --build-arg BUILD_{DATE}=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
 -t llm-experiment:v1.2.0 .
# Run experiment with config overrides
$ docker run \
 -e OPENAI_{API}_{KEY}=$OPENAI_API_KEY \
 -v $(pwd)/data:/app/data \
 -v $(pwd)/outputs:/app/outputs \
 llm-experiment:v1.2.0 \
 model=gpt4o retrieval.top_k=5 experiment.seed=42
Code Fragment 42.7.5a: Building the Docker image with Git SHA and date metadata baked in, then launching an experiment run with Hydra config overrides passed at the command line.
Reproducibility workflow chain: Hydra config (parameters) feeds Git plus DVC (code and data), then Docker build (environment), then deterministic experiment run, then MLflow or W and B (results tracking).
Figure 42.7.2c: Complete reproducibility workflow combining configuration, code, data, environment, and result tracking.
Warning: API-Based Models Break Full Reproducibility

When using API-based models (OpenAI, Anthropic, Google), you can never achieve full bit-level reproducibility because the provider controls the inference hardware and may change it between requests. The best you can do is pin the model version, set temperature to 0, provide a seed, and log the system_fingerprint. For experiments requiring guaranteed reproducibility, use locally hosted open-weight models where you control the entire stack.

Real-World Scenario
Reproducing a Winning Experiment Configuration Three Months Later

Who: Research engineering team at an NLP startup that had found a promising prompt and model configuration during a week of rapid experimentation

Situation: Three months after the initial experiments, the team needed to reproduce their best-performing configuration as a baseline for a new project. The original researcher had left the company.

Problem: Experiment notes were scattered across Slack messages, Jupyter notebooks, and a shared spreadsheet. The team could not determine the exact prompt version, model parameters, evaluation dataset version, or Python dependencies used in the winning run.

Dilemma: Investing in reproducibility infrastructure felt like overhead during rapid experimentation, but the cost of not having it was now weeks of wasted effort trying to recreate results.

Decision: After rebuilding the baseline (at significant cost), the team implemented a full reproducibility stack: Hydra for configuration, DVC for data versioning, MLflow for experiment tracking, and Docker for environment pinning.

How: Every experiment run was defined by a Hydra YAML config that captured the complete parameter set (prompt template, model name, temperature, seed, evaluation dataset path). DVC tracked the evaluation dataset and knowledge base versions alongside git commits. MLflow logged all configs, metrics, and output artifacts. A Docker image pinned all Python dependencies.

Result: Six months after implementation, a similar situation arose: a team member needed to reproduce a 4-month-old experiment. They found the MLflow run, checked out the associated git commit (which included Hydra config and DVC data pointers), built the Docker image, and reproduced the results within 0.3% of the original metrics in under 30 minutes.

Lesson: The upfront cost of reproducibility infrastructure (Hydra, DVC, MLflow, Docker) is a fraction of the cost of even one failed attempt to reproduce important results months later.

Tip: Monitor Latency by Percentile, Not Average

Report p50, p90, p95, and p99 latencies separately. A healthy p50 can hide a terrible p99, and it is the slowest requests that generate the most user complaints and support tickets.

Research Frontier

Open Questions in LLM Reproducibility (2024-2026):

Explore Further: Set up a Hydra + DVC + MLflow stack for a simple LLM evaluation experiment, then have a colleague reproduce your results using only the logged artifacts. Measure how close their results come.

Key Takeaways
Self-Check

1. List five layers of the LLM reproducibility stack and explain what must be versioned at each layer.

Show Answer
(1) Prompt layer: system prompts, user templates, few-shot examples, tool descriptions must be versioned in files tracked by git. (2) Model layer: provider name, model version (with date suffix), temperature, seed, max_tokens must be captured in configuration files. (3) Data layer: evaluation datasets and knowledge base contents must be versioned with DVC or similar tools. (4) Code layer: application code (git SHA), library versions (requirements.txt) must be committed and pinned. (5) Infrastructure layer: Docker image with pinned base image, API endpoint configuration, and hardware specification for local models.

2. How does Hydra improve reproducibility compared to using command-line arguments or environment variables?

Show Answer
Hydra automatically saves the complete, fully resolved configuration for every experiment run to a timestamped output directory. This means every parameter value is recorded without any manual effort. Hydra also supports composition (combining multiple config files), type validation, and structured overrides. Command-line arguments and environment variables, by contrast, are ephemeral and not automatically recorded. You would need to manually log them, which is error-prone and often forgotten. Hydra makes configuration explicit, versioned, and automatically archived.

3. Why does DVC store pointer files in git rather than the actual data?

Show Answer
Large datasets and embedding indexes can be gigabytes or more, which would bloat the git repository, slow down cloning, and exceed GitHub's file size limits. DVC stores small pointer files (containing a hash of the data) in git and the actual data in remote storage (S3, GCS, etc.). When you check out a specific git commit and run dvc checkout, DVC uses the pointer file to download the exact data version from remote storage. This gives you git-like versioning semantics for large files without the storage overhead.

4. When would you choose MLflow over Weights & Biases for experiment tracking?

Show Answer
Choose MLflow when: (1) you require self-hosted infrastructure for data privacy or compliance reasons, (2) you want a fully open-source solution with no vendor lock-in, (3) your team already uses MLflow for traditional ML experiments, or (4) you need to integrate with tools in the MLflow ecosystem (model registry, deployment). W&B is preferred when you want the best visualization UI, team collaboration features, or managed cloud infrastructure.

5. Why can you not achieve full bit-level reproducibility with API-based LLM providers?

Show Answer
API providers control the inference infrastructure, including GPU type, parallelization strategy, and floating-point computation order. Even with temperature=0 and a fixed seed, floating-point arithmetic on GPUs is not perfectly deterministic because operations like matrix multiplication can be parallelized in different orders that produce slightly different rounding results. Additionally, providers may silently change the hardware or software stack between requests. The system_fingerprint field helps detect such changes, but cannot prevent them. For guaranteed reproducibility, you must host the model locally and control the entire inference stack.

Exercises

Exercise 30.4.1: Reproducibility Challenges Conceptual

List five factors that make LLM experiments harder to reproduce than traditional ML experiments. For each factor, explain whether it is under the experimenter's control.

Answer Sketch

(1) Provider-side model updates (not under control). (2) Non-deterministic GPU computation (partially controllable with seeds). (3) System prompt versioning (under control if you version prompts). (4) API behavior changes and safety filters (not under control). (5) Evaluation dataset contamination in training data (not under control for API models). The key insight is that many irreproducibility sources are outside the experimenter's control, making documentation and logging even more critical.

Exercise 30.4.2: Experiment Logging Coding

Write a Python class ExperimentLogger that captures all parameters needed to reproduce an LLM experiment: model name, version, temperature, seed, system prompt hash, evaluation dataset hash, library versions, and timestamp. It should save to a JSON file and support loading for reproduction.

Answer Sketch

The class stores parameters in a dictionary. Use hashlib.sha256 to hash prompt text and dataset contents. Use pkg_resources or importlib.metadata to capture library versions. The save() method writes JSON with json.dump. The load() classmethod reads it back. Include a verify() method that checks whether the current environment matches the logged configuration and warns about discrepancies.

Exercise 30.4.3: Version Pinning Strategy Analysis

Your team uses GPT-4o through the API. Should you pin to a specific dated version (e.g., "gpt-4o-2024-08-06") or use the latest alias ("gpt-4o")? Analyze the tradeoffs for a production system vs. a research project.

Answer Sketch

Production: always pin to a dated version. Tradeoffs: you get stability and reproducibility, but miss automatic improvements and must manually update when the version is deprecated. Research: using the latest alias is acceptable for initial exploration but pin for final experiments. Tradeoffs: you get the newest capabilities automatically, but results are not reproducible across time. Best practice: pin in production, test new versions in staging, and promote after evaluation passes.

Exercise 30.4.4: Containerized Experiments Conceptual

Explain how Docker containers help with LLM experiment reproducibility. What aspects of reproducibility do containers address, and what aspects do they not address?

Answer Sketch

Containers address: OS environment, library versions, Python version, configuration files, and local model weights. They guarantee the code layer and infrastructure layer are identical. Containers do NOT address: API-side model changes (the container calls an external API that may change), non-deterministic GPU behavior, evaluation dataset updates stored outside the container, or rate limiting differences. For full reproducibility, combine containers with model version pinning, dataset versioning, and response caching.

Exercise 30.4.5: Response Caching for Reproducibility Coding

Design a response caching system that stores LLM API responses keyed by (model, prompt_hash, parameters_hash). Explain how this enables exact reproducibility and what the storage and invalidation tradeoffs are.

Answer Sketch

Key: SHA-256 hash of (model_name + model_version + prompt_text + json(parameters)). Value: full API response including tokens, finish_reason, and metadata. Storage: SQLite for local development, Redis or S3 for shared environments. Invalidation: never invalidate for reproducibility; instead, include the cache version in the key. Tradeoffs: storage grows linearly with unique requests (manageable for evaluations, problematic for production traffic). Provides exact reproducibility for any cached request regardless of provider changes.

What Comes Next

In the next section, Section 42.8: LLM-as-Judge: Reliability, Debiasing, and Training Judge Models, we explore arena-style evaluation methods that use real user preferences and pairwise comparisons to produce more trustworthy model rankings than static benchmarks.

Further Reading

Configuration Management

Yadan, O. (2019). "Hydra: A Framework for Elegantly Configuring Complex Applications." https://hydra.cc/

Data Versioning

Iterative. (2024). "DVC: Data Version Control." https://dvc.org/

Experiment Tracking

Zaharia, M., Chen, A., Davidson, A., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." https://mlflow.org/
Biewald, L. (2020). "Experiment Tracking with Weights and Biases." https://wandb.ai/

Reproducibility Standards

Pineau, J., Vincent-Lamarre, P., Sinha, K., et al. (2021). "Improving Reproducibility in Machine Learning Research." arXiv:2003.12206