Section E.2: Data Version Control (DVC)

DVC extends Git to handle large datasets and model files. It works alongside Git: code is tracked by Git, and data and models are tracked by DVC. The two stay in sync through lightweight .dvc pointer files that Git does track. Code Fragment E.2.1 below puts this into practice.

# Install DVC
pip install dvc dvc-s3  # or dvc-gs for Google Cloud, dvc-azure for Azure

# Initialize DVC in an existing Git repo
dvc init

# Track a large dataset
dvc add data/training_corpus.parquet
# This creates data/training_corpus.parquet.dvc (a small pointer file)

git add data/training_corpus.parquet.dvc data/.gitignore
git commit -m "Track training corpus with DVC"

# Push data to remote storage (e.g., S3 bucket)
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Example dvc.yaml pipeline
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
      - data/raw/corpus.jsonl
      - scripts/preprocess.py
    outs:
      - data/processed/train.jsonl
      - data/processed/eval.jsonl

  train:
    cmd: python scripts/train.py --config configs/lora.yaml
    deps:
      - data/processed/train.jsonl
      - scripts/train.py
      - configs/lora.yaml
    outs:
      - models/finetuned/
    metrics:
      - metrics/train_results.json:
          cache: false

Code Fragment E.2.1: Initializing DVC, tracking a large dataset file, pushing it to remote storage (S3), and defining a reproducible two-stage pipeline in dvc.yaml with explicit dependencies and outputs.

Key Insight: DVC Pipelines

DVC can also define reproducible pipelines with dvc.yaml. Each stage specifies its dependencies (input files, scripts) and outputs. Running dvc repro re-executes only the stages whose dependencies have changed. This is invaluable for ML workflows where you need to reprocess data, retrain, and re-evaluate in a consistent order. Code Fragment E.2.2 below puts this into practice.

Code Fragment E.2.2: Running dvc repro to re-execute only the pipeline stages whose dependencies have changed, and using dvc metrics show to compare results across experiments.