Building Conversational AI with LLMs and Agents
Appendix R: Experiment Tracking: W&B and MLflow

Weights and Biases: Runs, Logging, and Sweeps

Big Picture

Weights & Biases (W&B) is a platform for tracking machine learning experiments, visualizing results, and collaborating with teams. Every training run, hyperparameter sweep, and model evaluation is logged automatically, giving you a complete audit trail of your ML work. This section covers the core workflow: initializing runs, logging metrics and artifacts, configuring sweeps for hyperparameter optimization, and managing datasets with W&B Artifacts.

1. Installation and Authentication

W&B provides a Python client that integrates with any ML framework. After installation, you authenticate once and the credentials are cached locally.

# Install the W&B client
pip install wandb

# Authenticate (one-time setup)
import wandb
wandb.login()  # Opens a browser to get your API key
# Or set the API key directly:
# wandb.login(key="your-api-key")

All logged data is sent to the W&B cloud platform (or a self-hosted server for enterprise deployments). Each user belongs to one or more teams, and each team contains projects that organize related experiments.

2. Initializing a Run

A run is a single execution of your training script. Every metric, hyperparameter, and artifact logged during that execution is associated with the run. Initialize a run at the start of your script.

import wandb

# Start a new run
run = wandb.init(
    project="llm-fine-tuning",        # Project name (created if new)
    name="gpt2-lora-experiment-03",    # Human-readable run name
    config={                           # Hyperparameters
        "model": "gpt2",
        "learning_rate": 2e-5,
        "batch_size": 16,
        "epochs": 3,
        "lora_rank": 8,
        "lora_alpha": 16,
        "dataset": "alpaca-52k",
    },
    tags=["lora", "gpt2", "baseline"],
    notes="Baseline LoRA fine-tune with default hyperparameters.",
)

# Access config values throughout your script
lr = wandb.config.learning_rate
bs = wandb.config.batch_size

The config dictionary is critical for reproducibility. It records every decision that affects the experiment: model choice, hyperparameters, dataset version, and preprocessing steps. W&B displays these configs in a filterable table, making it easy to compare runs.

Key Insight

Always log your complete configuration, not just the hyperparameters you are tuning. Include the dataset version, random seed, library versions, and hardware type. When you revisit an experiment months later, the config is your only reliable record of what you actually ran.

3. Logging Metrics

During training, log metrics at each step or epoch. W&B automatically generates interactive charts for every logged metric.

# Training loop with W&B logging
for epoch in range(wandb.config.epochs):
    for step, batch in enumerate(dataloader):
        loss = train_step(model, batch)

        # Log step-level metrics
        wandb.log({
            "train/loss": loss,
            "train/learning_rate": scheduler.get_last_lr()[0],
            "train/step": step + epoch * len(dataloader),
        })

    # Log epoch-level metrics
    val_loss, val_accuracy = evaluate(model, val_dataloader)
    wandb.log({
        "val/loss": val_loss,
        "val/accuracy": val_accuracy,
        "epoch": epoch,
    })

    print(f"Epoch {epoch}: val_loss={val_loss:.4f}, val_acc={val_accuracy:.3f}")
MLflow tracking URI: http://localhost:5000 Experiment: llm-fine-tuning Run ID: abc123def456 Logged: learning_rate=2e-5, epochs=3, loss=0.234 Artifact saved: model_checkpoint/

Use a consistent naming convention with slashes to organize metrics into groups: train/loss, val/loss, test/accuracy. W&B uses the prefix to group charts automatically in the dashboard.

4. Logging Rich Media

Beyond scalar metrics, W&B supports images, audio, tables, histograms, and custom HTML. This is especially useful for LLM projects where you want to track generated text quality.

# Log a table of model predictions
columns = ["prompt", "generated", "reference", "score"]
table = wandb.Table(columns=columns)

for example in eval_examples:
    generated = model.generate(example.prompt)
    score = evaluate_quality(generated, example.reference)
    table.add_data(example.prompt, generated, example.reference, score)

wandb.log({"predictions": table})

# Log a histogram of token probabilities
import numpy as np
probs = get_token_probabilities(model, sample_text)
wandb.log({
    "token_probs": wandb.Histogram(probs),
})

# Log generated text samples
wandb.log({
    "sample_output": wandb.Html(f"
{generated_text}
"), })
Tip

For LLM projects, log a prediction table at the end of each evaluation run. Include the prompt, generated text, reference text, and quality score. This creates a browsable history of how your model's outputs evolve across experiments, which is far more informative than aggregate metrics alone.

5. Artifacts: Dataset and Model Versioning

W&B Artifacts track datasets, model checkpoints, and other files with automatic versioning and lineage tracking.

# Log a dataset as an artifact
dataset_artifact = wandb.Artifact(
    name="alpaca-training-data",
    type="dataset",
    description="Cleaned Alpaca instruction dataset, 52k examples",
    metadata={"num_examples": 52002, "format": "jsonl"},
)
dataset_artifact.add_file("data/alpaca_cleaned.jsonl")
wandb.log_artifact(dataset_artifact)

# Log a model checkpoint as an artifact
model_artifact = wandb.Artifact(
    name="gpt2-lora-finetuned",
    type="model",
    metadata={
        "base_model": "gpt2",
        "lora_rank": 8,
        "val_loss": val_loss,
    },
)
model_artifact.add_dir("checkpoints/best/")
wandb.log_artifact(model_artifact)

# Later, download and use an artifact
artifact = wandb.use_artifact("gpt2-lora-finetuned:v3")
artifact_dir = artifact.download()

Artifacts create a dependency graph: each run consumes input artifacts and produces output artifacts. This lineage tracking lets you answer questions like "which training data produced this model?" and "which model checkpoint gave us the best evaluation results?"

6. Hyperparameter Sweeps

Sweeps automate hyperparameter search by launching multiple runs with different configurations. W&B supports grid search, random search, and Bayesian optimization.

# Define a sweep configuration
sweep_config = {
    "method": "bayes",  # "grid", "random", or "bayes"
    "metric": {
        "name": "val/loss",
        "goal": "minimize",
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-6,
            "max": 1e-3,
        },
        "batch_size": {"values": [8, 16, 32]},
        "lora_rank": {"values": [4, 8, 16, 32]},
        "epochs": {"value": 3},  # Fixed value
    },
}

# Create the sweep
sweep_id = wandb.sweep(sweep_config, project="llm-fine-tuning")

# Define the training function
def train():
    run = wandb.init()
    config = wandb.config

    # Your training code here, using config values
    model = setup_model(lora_rank=config.lora_rank)
    optimizer = setup_optimizer(lr=config.learning_rate)
    train_loop(model, optimizer, epochs=config.epochs, batch_size=config.batch_size)

# Launch sweep agents
wandb.agent(sweep_id, function=train, count=20)  # Run 20 trials
Warning

Bayesian sweeps become more effective with more data points. If you can only afford a few runs, random search is often more efficient because Bayesian optimization needs an initial exploration phase. Switch to Bayesian optimization when you plan to run 20 or more trials.

7. Finishing Runs and Best Practices

Always finish your run cleanly to ensure all data is uploaded. Combine run finishing with summary metrics for easy comparison across experiments.

# Log summary metrics (shown in the runs table)
wandb.summary["best_val_loss"] = best_val_loss
wandb.summary["best_val_accuracy"] = best_val_accuracy
wandb.summary["total_training_time"] = training_time_seconds

# Finish the run
wandb.finish()

# In notebooks or scripts that might crash, use a context manager:
with wandb.init(project="llm-fine-tuning") as run:
    # Training code here
    wandb.log({"train/loss": loss})
    # run.finish() is called automatically

Summary metrics appear in the main runs table and are the primary values used for filtering and sorting experiments. Always set the most important metrics as summary values: best validation loss, final accuracy, total training time, and peak memory usage.