Appendices
Appendix E: Git, DVC, and Reproducibility

E.1 Git Basics for ML Projects

Git is the foundation of code versioning, but ML projects require special attention to what goes into the repository and what stays out. The single most important configuration for any ML repo is the .gitignore file. Code Fragment E.1.1 below puts this into practice.

A .gitignore for LLM Work

This snippet provides a .gitignore template tailored for LLM projects, covering model weights, datasets, and environment files.

# Model weights and checkpoints
*.bin
*.safetensors
*.pt
*.pth
*.ckpt
*.gguf
checkpoints/
models/

# Datasets
data/raw/
data/processed/
*.parquet
*.arrow

# Hugging Face cache
.cache/

# Training outputs
runs/
wandb/
mlruns/
outputs/
*.log

# Environment
*.env
.env
__pycache__/
*.pyc
*.egg-info/
venv/
llm-env/

# Jupyter notebook checkpoints
.ipynb_checkpoints/

# OS files
.DS_Store
Thumbs.db
# Install Git LFS
git lfs install

# Track specific large file types
git lfs track "*.safetensors"
git lfs track "*.bin"

# The .gitattributes file is created automatically
git add .gitattributes
git commit -m "Configure Git LFS for model files"

# Now large files are handled transparently
git add model.safetensors
git commit -m "Add fine-tuned model checkpoint"
Code Fragment E.1.1: A .gitignore tailored for ML projects, excluding large model checkpoints, dataset files, virtual environments, and the Hugging Face cache directory.
Warning: Accidentally Committing Large Files

If you accidentally commit a large model file to Git, simply deleting it in a later commit does not fix the problem; the file remains in Git's history and inflates the repository forever. Use git filter-branch or the BFG Repo Cleaner to remove large files from history. Better yet, set up your .gitignore before your first commit.

Git LFS for Large Files

When you do need to version large files (specific model checkpoints that are part of your deliverable, for instance), Git Large File Storage (LFS) replaces them with lightweight pointers in the repository while storing the actual file contents on a remote server. Code Fragment E.1.2 below puts this into practice.

Code Fragment E.1.2: Configuring Git LFS to track large model files (.safetensors, .bin). LFS replaces these files with lightweight pointers in the repository and stores the actual data on a remote server.

Hugging Face Hub uses Git LFS under the hood for all model repositories. When you push a model with model.push_to_hub(), it is automatically stored via LFS.

Branching Strategy for ML

A practical branching strategy for ML projects: