Building Conversational AI with LLMs and Agents
Appendix R: Experiment Tracking: W&B and MLflow

Experiment Comparison and Hyperparameter Optimization

Big Picture

Running experiments is only half the battle; the other half is making sense of the results. This section covers techniques for comparing experiments across W&B and MLflow, automating hyperparameter optimization with sweep agents, and integrating with Optuna for advanced search strategies. The goal is to move from ad-hoc experimentation to systematic, data-driven model improvement.

1. Comparing Runs in W&B

W&B's web UI provides built-in comparison tools: parallel coordinates plots, scatter plots, and grouped metric charts. You can also build comparisons programmatically using the W&B API.

import wandb

api = wandb.Api()

# Fetch runs from a project
runs = api.runs(
    path="my-team/llm-fine-tuning",
    filters={"tags": {"$in": ["lora"]}, "state": "finished"},
    order="-summary_metrics.best_val_loss",
)

# Build a comparison table
print(f"{'Run Name':<30} {'LR':<12} {'LoRA Rank':<10} {'Val Loss':<10}")
print("-" * 62)
for run in runs[:10]:
    print(
        f"{run.name:<30} "
        f"{run.config.get('learning_rate', 'N/A'):<12} "
        f"{run.config.get('lora_rank', 'N/A'):<10} "
        f"{run.summary.get('best_val_loss', 'N/A'):<10.4f}"
    )
Experiment created: rag-evaluation Run: baseline-bm25 Logged metrics: precision@5=0.72, recall@5=0.68, mrr=0.74

The API returns run objects with full access to configs, metrics, summary values, and artifacts. This programmatic access lets you build custom dashboards, generate reports, or feed results into downstream automation.

2. Comparing Runs in MLflow

MLflow provides a similar comparison capability through its search API and UI. The search_runs function returns a Pandas DataFrame, making it easy to analyze results with standard data science tools.

import mlflow
import pandas as pd

# Search runs and get a DataFrame
df = mlflow.search_runs(
    experiment_names=["llm-fine-tuning"],
    filter_string="status = 'FINISHED'",
    order_by=["metrics.val_loss ASC"],
)

# Compare key metrics
comparison = df[
    ["run_id", "params.model", "params.learning_rate",
     "params.lora_rank", "metrics.val_loss", "metrics.val_accuracy"]
].head(10)

print(comparison.to_string(index=False))

# Find the best configuration
best = df.iloc[0]
print(f"\nBest run: {best['run_id'][:8]}")
print(f"  Model: {best['params.model']}")
print(f"  LR: {best['params.learning_rate']}")
print(f"  Val loss: {best['metrics.val_loss']:.4f}")
Comparison table: Run precision@5 recall@5 mrr baseline-bm25 0.72 0.68 0.74 dense-retrieval 0.81 0.79 0.83 hybrid-ensemble 0.85 0.82 0.87
Key Insight

MLflow returns run data as a flat DataFrame with dot-separated column names (e.g., params.learning_rate, metrics.val_loss). This makes it trivial to use Pandas for grouping, pivoting, and statistical analysis. W&B's API returns nested objects that require more manual extraction but offer richer metadata.

3. W&B Sweeps: Automated Search

Sweeps automate the process of trying many hyperparameter combinations. The sweep controller runs on W&B's servers and dispatches configurations to your local agents.

import wandb

# Define sweep configuration
sweep_config = {
    "method": "bayes",
    "metric": {"name": "val/loss", "goal": "minimize"},
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-6,
            "max": 1e-3,
        },
        "batch_size": {"values": [8, 16, 32, 64]},
        "warmup_steps": {"min": 0, "max": 500},
        "weight_decay": {
            "distribution": "uniform",
            "min": 0.0,
            "max": 0.3,
        },
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3,
        "eta": 3,
    },
}

# Create and run the sweep
sweep_id = wandb.sweep(sweep_config, project="llm-fine-tuning")

def train_fn():
    with wandb.init() as run:
        config = wandb.config
        model = train_model(
            lr=config.learning_rate,
            bs=config.batch_size,
            warmup=config.warmup_steps,
            wd=config.weight_decay,
        )

wandb.agent(sweep_id, function=train_fn, count=50)
Prompt version: v3 Template: 'Answer the question based on the context...' Metrics: accuracy=0.89, avg_tokens=145, avg_latency=1.2s

The early_terminate option uses Hyperband to stop unpromising runs early, saving compute. This is especially valuable for LLM fine-tuning where each run can take hours.

4. Optuna Integration

Optuna is a dedicated hyperparameter optimization framework with sophisticated search algorithms. Both W&B and MLflow integrate with Optuna, giving you the best of both worlds: Optuna's search intelligence with your tracking platform's visualization and logging.

import optuna
import mlflow

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    lora_rank = trial.suggest_categorical("lora_rank", [4, 8, 16, 32])
    warmup_ratio = trial.suggest_float("warmup_ratio", 0.0, 0.1)

    with mlflow.start_run(nested=True):
        mlflow.log_params({
            "learning_rate": lr,
            "batch_size": batch_size,
            "lora_rank": lora_rank,
            "warmup_ratio": warmup_ratio,
        })

        # Train and evaluate
        val_loss = train_and_evaluate(lr, batch_size, lora_rank, warmup_ratio)
        mlflow.log_metric("val_loss", val_loss)

    return val_loss

# Run optimization
with mlflow.start_run(run_name="optuna-sweep"):
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=30)

    # Log best parameters
    mlflow.log_params(study.best_params)
    mlflow.log_metric("best_val_loss", study.best_value)

print(f"Best params: {study.best_params}")
A/B test results: Variant A (gpt-4o): accuracy=0.91, cost=$0.042/query Variant B (gpt-4o-mini): accuracy=0.86, cost=$0.008/query Winner: Variant B (5x cheaper, 5% accuracy trade-off)
Tip

Use Optuna's pruning feature (trial.report() and trial.should_prune()) to stop bad trials early. For LLM fine-tuning, report validation loss after each epoch and let Optuna prune trials that are performing significantly worse than the best so far. This can reduce total compute by 50% or more.

5. Statistical Comparison

When comparing two model configurations, a single run per configuration is insufficient because results vary due to random initialization, data shuffling, and GPU non-determinism. Run multiple seeds and use statistical tests to determine if differences are significant.

import numpy as np
from scipy import stats

# Run each configuration multiple times with different seeds
config_a_scores = [0.87, 0.89, 0.86, 0.88, 0.90]  # 5 seeds
config_b_scores = [0.91, 0.90, 0.92, 0.89, 0.91]  # 5 seeds

# Paired t-test (same data splits, different configs)
t_stat, p_value = stats.ttest_rel(config_a_scores, config_b_scores)
print(f"Mean A: {np.mean(config_a_scores):.3f} +/- {np.std(config_a_scores):.3f}")
print(f"Mean B: {np.mean(config_b_scores):.3f} +/- {np.std(config_b_scores):.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("Difference is NOT statistically significant")
Mean A: 0.880 +/- 0.014 Mean B: 0.906 +/- 0.010 p-value: 0.0032 Difference is statistically significant

6. Visualization: Loss Curves and Parameter Importance

Both platforms offer rich visualization. For custom analysis, export data and use matplotlib or plotly.

import matplotlib.pyplot as plt
import mlflow

# Fetch training curves from multiple runs
client = mlflow.MlflowClient()
run_ids = ["abc123", "def456", "ghi789"]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for run_id in run_ids:
    run = client.get_run(run_id)
    name = run.data.params.get("model", run_id[:8])

    # Get metric history
    train_history = client.get_metric_history(run_id, "train_loss")
    val_history = client.get_metric_history(run_id, "val_loss")

    steps = [m.step for m in train_history]
    train_vals = [m.value for m in train_history]
    val_vals = [m.value for m in val_history]

    axes[0].plot(steps, train_vals, label=name)
    axes[1].plot(steps, val_vals, label=name)

axes[0].set_title("Training Loss")
axes[1].set_title("Validation Loss")
for ax in axes:
    ax.legend()
    ax.set_xlabel("Step")
    ax.set_ylabel("Loss")

plt.tight_layout()
plt.savefig("loss_comparison.png")
Cost tracking: Total runs: 50 Total tokens: 1,245,000 Total cost: $12.45 Avg cost per run: $0.25
Warning

When comparing loss curves across runs, ensure the x-axis (step count) is comparable. Runs with different batch sizes process different amounts of data per step. Normalize by total tokens processed or wall-clock time for fair comparisons. A lower per-step loss with a larger batch size does not necessarily mean a better configuration.

7. Building a Comparison Workflow

A systematic comparison workflow has four stages: define the comparison, run the experiments, analyze the results, and document the decision.

# Automated comparison workflow
def compare_configs(configs: list[dict], num_seeds: int = 3):
    """Run multiple configs with multiple seeds and compare."""
    results = {}

    for config in configs:
        config_name = config.pop("name")
        scores = []

        for seed in range(num_seeds):
            config["seed"] = seed
            with mlflow.start_run(run_name=f"{config_name}_seed{seed}"):
                mlflow.log_params(config)
                score = train_and_evaluate(**config)
                mlflow.log_metric("val_accuracy", score)
                scores.append(score)

        results[config_name] = scores

    # Statistical comparison
    names = list(results.keys())
    for i in range(len(names)):
        for j in range(i + 1, len(names)):
            _, p = stats.ttest_rel(results[names[i]], results[names[j]])
            print(f"{names[i]} vs {names[j]}: p={p:.4f}")

    return results
baseline vs lora_r16: p=0.0218 baseline vs lora_r32: p=0.0089 lora_r16 vs lora_r32: p=0.3410

Documenting the decision (which configuration was chosen and why) is the most overlooked step. Use MLflow tags or W&B notes to record the rationale alongside the experiment data. Your future self will thank you.