Section E.4: Reproducibility Best Practices

Reproducibility is the ability for someone else (or your future self) to re-run your experiment and get the same results. In ML, achieving perfect reproducibility is difficult due to floating-point non-determinism on GPUs, but you can get very close. Code Fragment E.4.1 below puts this into practice.

The Reproducibility Checklist

Pin all dependencies. Use pip freeze > requirements.txt or conda env export > environment.yml. Record the exact PyTorch, Transformers, and CUDA versions.
Set random seeds. Fix seeds for Python, NumPy, and PyTorch at the start of every script:
```
# PyTorch implementation
# See inline comments for step-by-step details.
import random, numpy as np, torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
```
Code Fragment E.4.1: Fixing random seeds for Python, NumPy, and PyTorch (including CUDA) to ensure reproducible training runs across executions.
Version your data. Use DVC or record dataset checksums (SHA-256 hashes) in your experiment logs.
Log everything. Hyperparameters, hardware info, git commit hash, and the exact command used to launch training.
Use configuration files. Store hyperparameters in YAML or JSON files rather than command-line arguments. Version these files with Git.

Record hardware info. GPU model, driver version, and CUDA version can all affect results. Log them automatically:


# PyTorch implementation
# See inline comments for step-by-step details.
import torch
config = {
    "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU",
    "cuda_version": torch.version.cuda,
    "pytorch_version": torch.__version__,
}

Code Fragment E.4.2: Capturing hardware metadata (GPU model, CUDA version, PyTorch version) in a dictionary for inclusion in experiment logs.

Fun Fact: The Reproducibility Crisis

A 2022 survey found that only 30% of ML papers included enough information to reproduce their results. The ML community is gradually improving through tools like DVC, W&B, and Hugging Face Model Cards, but the gap between "it works on my machine" and "anyone can verify this" remains wide. Being rigorous about versioning and logging puts you ahead of most practitioners.

Big Picture: Collaboration as Infrastructure

Git, DVC, and experiment tracking are not overhead; they are infrastructure. A team that invests in these tools early spends less time debugging "what changed?" and more time making progress. For solo practitioners, these same tools serve as a conversation with your future self: three months from now, you will not remember which learning rate produced that promising result unless you logged it.

Collaboration Toolkit Summary

Start every ML repo with a thorough .gitignore that excludes model weights, datasets, and cache directories.
Use Git LFS for the rare large files that must be versioned, and DVC for datasets and model checkpoints.
Track experiments with W&B (cloud, great UI) or MLflow (self-hosted, open source).
Pin dependencies, set random seeds, use config files, and log hardware info for reproducibility.
Treat collaboration tooling as an investment, not a cost. It compounds over every experiment you run.

What Comes Next

Continue to Appendix F: Glossary for the next reference appendix in this collection.