Section E.3: Experiment Tracking

When you run dozens of training experiments with different hyperparameters, keeping track of what you tried, what worked, and what failed becomes critical. Experiment tracking tools log metrics, hyperparameters, and artifacts automatically.

Weights & Biases (W&B)

W&B is the most popular experiment tracker in the LLM community. It integrates directly with the Hugging Face Trainer. Code Fragment E.3.1 below puts this into practice.


# Experiment tracking setup
# Key operations: loss calculation, training loop, structured logging
import wandb
from transformers import TrainingArguments, Trainer

# Initialize a W&B run
wandb.init(project="llm-finetuning", name="lora-r16-lr2e4")

training_args = TrainingArguments(
    output_dir="./output",
    report_to="wandb",         # enables automatic logging
    logging_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
)

# The Trainer automatically logs loss, learning rate, etc. to W&B
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
)
trainer.train()

wandb.finish()

MLflow

MLflow is an open-source alternative that can be self-hosted. It is popular in enterprise settings where data must stay on-premise. Code Fragment E.3.2 below puts this into practice.


# Experiment tracking setup
# Key operations: loss calculation, training loop, monitoring and metrics
import mlflow

mlflow.set_experiment("llm-finetuning")

with mlflow.start_run(run_name="lora-r16-lr2e4"):
    mlflow.log_params({
        "model": "llama-3.1-8b",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    })

    # ... training code ...

    mlflow.log_metrics({
        "eval_loss": 0.42,
        "eval_accuracy": 0.87,
    })

    # Log the model as an artifact
    mlflow.log_artifact("./output/adapter_model.safetensors")

Code Fragment E.3.1: Logging hyperparameters, evaluation metrics, and model artifacts to MLflow. Each run captures a complete snapshot of the experiment configuration and results for later comparison.

Feature Comparison

Feature	W&B	MLflow
Hosting	Cloud (free tier) or self-hosted	Self-hosted or Databricks
HF Trainer integration	Built-in	Via callback
Visualization	Excellent web dashboard	Good local UI
Team collaboration	Strong (shared workspaces)	Basic (model registry)
Cost	Free for individuals, paid for teams	Free and open source

Comparing Trackers

If your team already uses MLflow in production, adding W&B for personal experiments is still worthwhile. Most trackers can export runs as CSV or JSON, so migrating between platforms is straightforward.