Appendices
Appendix E: Git, DVC, and Reproducibility

E.3 Experiment Tracking

When you run dozens of training experiments with different hyperparameters, keeping track of what you tried, what worked, and what failed becomes critical. Experiment tracking tools log metrics, hyperparameters, and artifacts automatically.

Weights & Biases (W&B)

W&B is the most popular experiment tracker in the LLM community. It integrates directly with the Hugging Face Trainer. Code Fragment E.3.1 below puts this into practice.


# Experiment tracking setup
# Key operations: loss calculation, training loop, structured logging
import wandb
from transformers import TrainingArguments, Trainer

# Initialize a W&B run
wandb.init(project="llm-finetuning", name="lora-r16-lr2e4")

training_args = TrainingArguments(
    output_dir="./output",
    report_to="wandb",         # enables automatic logging
    logging_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
)

# The Trainer automatically logs loss, learning rate, etc. to W&B
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
)
trainer.train()

wandb.finish()

MLflow

MLflow is an open-source alternative that can be self-hosted. It is popular in enterprise settings where data must stay on-premise. Code Fragment E.3.2 below puts this into practice.


# Experiment tracking setup
# Key operations: loss calculation, training loop, monitoring and metrics
import mlflow

mlflow.set_experiment("llm-finetuning")

with mlflow.start_run(run_name="lora-r16-lr2e4"):
    mlflow.log_params({
        "model": "llama-3.1-8b",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    })

    # ... training code ...

    mlflow.log_metrics({
        "eval_loss": 0.42,
        "eval_accuracy": 0.87,
    })

    # Log the model as an artifact
    mlflow.log_artifact("./output/adapter_model.safetensors")
Code Fragment E.3.1: Logging hyperparameters, evaluation metrics, and model artifacts to MLflow. Each run captures a complete snapshot of the experiment configuration and results for later comparison.
Feature Comparison
Feature W&B MLflow
Hosting Cloud (free tier) or self-hosted Self-hosted or Databricks
HF Trainer integration Built-in Via callback
Visualization Excellent web dashboard Good local UI
Team collaboration Strong (shared workspaces) Basic (model registry)
Cost Free for individuals, paid for teams Free and open source
Comparing Trackers

If your team already uses MLflow in production, adding W&B for personal experiments is still worthwhile. Most trackers can export runs as CSV or JSON, so migrating between platforms is straightforward.