Building Conversational AI with LLMs and Agents
Appendix R: Experiment Tracking: W&B and MLflow

MLflow: Tracking, Projects, and Model Registry

Big Picture

MLflow is an open-source platform for managing the complete ML lifecycle: experiment tracking, reproducible packaging, model versioning, and deployment. Unlike W&B (a hosted SaaS platform), MLflow can run entirely on your own infrastructure, making it popular in regulated industries and organizations with strict data governance requirements. This section covers MLflow's tracking API, project packaging, and the model registry that manages model versions from development through production.

1. Installation and Setup

MLflow installs as a Python package and includes a local tracking server with a web UI. For team collaboration, you can deploy a remote tracking server backed by a database and artifact store.

# Install MLflow
pip install mlflow

# Start the local tracking UI (runs on port 5000)
# In a terminal:
# mlflow ui --host 0.0.0.0 --port 5000

# In Python, set the tracking URI
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")

# Or use a remote server
mlflow.set_tracking_uri("https://mlflow.mycompany.com")

The tracking URI tells the MLflow client where to send data. For local development, the default file-based store (mlruns/ directory) works well. For team use, deploy a tracking server backed by PostgreSQL or MySQL for metadata and S3, Azure Blob, or GCS for artifacts.

2. Experiments and Runs

MLflow organizes work into experiments (groups of related runs) and runs (individual executions). Create an experiment for each project or research question.

import mlflow

# Create or get an experiment
mlflow.set_experiment("llm-fine-tuning")

# Start a run
with mlflow.start_run(run_name="gpt2-lora-baseline") as run:
    # Log parameters (hyperparameters, configuration)
    mlflow.log_param("model", "gpt2")
    mlflow.log_param("learning_rate", 2e-5)
    mlflow.log_param("batch_size", 16)
    mlflow.log_param("lora_rank", 8)
    mlflow.log_param("dataset", "alpaca-52k")

    # Log metrics (can be called multiple times for time series)
    for epoch in range(3):
        train_loss = train_one_epoch(model, dataloader)
        val_loss = evaluate(model, val_dataloader)

        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

    # Log final metrics
    mlflow.log_metric("best_val_loss", min(val_losses))

    print(f"Run ID: {run.info.run_id}")
W&B run initialized: fine-tuning-exp-001 Project: llm-experiments Logged: {'epoch': 1, 'train_loss': 0.45, 'val_loss': 0.52} Logged: {'epoch': 2, 'train_loss': 0.28, 'val_loss': 0.35} Logged: {'epoch': 3, 'train_loss': 0.19, 'val_loss': 0.24}

The with block ensures the run is closed properly even if an exception occurs. You can also use mlflow.start_run() and mlflow.end_run() explicitly, but the context manager pattern is safer.

Key Insight

MLflow distinguishes between parameters (set once per run, immutable) and metrics (logged repeatedly with a step counter). Log hyperparameters as parameters and training/evaluation scores as metrics. This distinction enables proper filtering and comparison in the UI.

3. Logging Artifacts

Artifacts are files associated with a run: model checkpoints, datasets, plots, configuration files, or any other output you want to preserve.

with mlflow.start_run():
    # Log a single file
    mlflow.log_artifact("config.yaml")

    # Log all files in a directory
    mlflow.log_artifacts("./outputs/plots/", artifact_path="plots")

    # Log a model checkpoint with metadata
    mlflow.log_artifact(
        "checkpoints/best_model.pt",
        artifact_path="model",
    )

    # Log a text file with predictions
    with open("predictions.txt", "w") as f:
        for prompt, output in predictions:
            f.write(f"Prompt: {prompt}\nOutput: {output}\n\n")
    mlflow.log_artifact("predictions.txt")

Artifacts are stored in the configured artifact store (local filesystem, S3, Azure Blob, or GCS). Each run gets its own artifact directory, so there is no risk of overwriting artifacts from other runs.

4. Autologging

MLflow provides autologging integrations for popular frameworks. Enable autologging and MLflow captures parameters, metrics, and models automatically without manual log_param calls.

import mlflow

# Autolog for PyTorch / Hugging Face Transformers
mlflow.transformers.autolog(
    log_input_examples=True,
    log_model_signatures=True,
)

# Autolog for scikit-learn
mlflow.sklearn.autolog()

# Now your training code is instrumented automatically
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=16,
    ),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# This automatically logs all training args, metrics, and the model
trainer.train()
Tip

Use autologging for exploratory work and manual logging for production pipelines. Autologging captures everything, which is convenient but can produce noisy experiment records. For production, explicitly log only the parameters and metrics you need for decision-making.

5. MLflow Projects: Reproducible Packaging

An MLflow Project is a self-contained package that specifies dependencies, entry points, and parameters. It ensures that anyone can reproduce your experiment on any machine.

# MLproject file (YAML format)
# name: llm-fine-tuning
# conda_env: conda.yaml
#
# entry_points:
#   train:
#     parameters:
#       learning_rate: {type: float, default: 2e-5}
#       batch_size: {type: int, default: 16}
#       epochs: {type: int, default: 3}
#       model_name: {type: str, default: "gpt2"}
#     command: "python train.py --lr {learning_rate} --bs {batch_size}
#              --epochs {epochs} --model {model_name}"
#
#   evaluate:
#     parameters:
#       model_path: {type: str}
#     command: "python evaluate.py --model {model_path}"

# Run a project from a Git repo
mlflow.run(
    "https://github.com/myorg/llm-finetuning",
    entry_point="train",
    parameters={
        "learning_rate": 3e-5,
        "batch_size": 32,
    },
)

Projects can be run locally, on a remote cluster (via Kubernetes or Databricks), or from any Git repository. The dependency specification (conda environment or Docker image) ensures the execution environment is identical across runs.

6. The Model Registry

The model registry provides a centralized store for model versions, with stage transitions (Staging, Production, Archived) and access controls.

# Log a model to the registry
with mlflow.start_run():
    # Train your model...
    mlflow.log_metric("val_accuracy", 0.92)

    # Register the model
    mlflow.transformers.log_model(
        transformers_model={"model": model, "tokenizer": tokenizer},
        artifact_path="model",
        registered_model_name="gpt2-lora-qa",
    )

# Transition a model version to production
from mlflow import MlflowClient
client = MlflowClient()

client.transition_model_version_stage(
    name="gpt2-lora-qa",
    version=3,
    stage="Production",
)

# Load the production model
model_uri = "models:/gpt2-lora-qa/Production"
loaded_model = mlflow.transformers.load_model(model_uri)
Warning

The model registry's stage transitions (Staging, Production, Archived) are being replaced by model aliases in newer MLflow versions. Aliases are more flexible: you can define custom tags like "champion" and "challenger" rather than being limited to fixed stages. Check your MLflow version and use the recommended API.

7. Querying Runs Programmatically

The MLflow client API lets you search and compare runs programmatically, which is useful for automated model selection and reporting.

from mlflow import MlflowClient

client = MlflowClient()

# Search for runs matching criteria
runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="metrics.val_accuracy > 0.85 and params.model = 'gpt2'",
    order_by=["metrics.val_accuracy DESC"],
    max_results=10,
)

# Print results
for run in runs:
    print(
        f"Run {run.info.run_id[:8]}: "
        f"accuracy={run.data.metrics['val_accuracy']:.3f}, "
        f"lr={run.data.params['learning_rate']}"
    )

# Find the best run
best_run = runs[0]
print(f"Best run: {best_run.info.run_id}")
W&B sweep initiated: sweep_abc123 Trial 1/10: lr=1e-4, batch_size=16, val_loss=0.34 Trial 5/10: lr=3e-5, batch_size=32, val_loss=0.22 Trial 10/10: lr=5e-5, batch_size=32, val_loss=0.21 Best config: lr=5e-5, batch_size=32

The filter string syntax supports comparisons on parameters, metrics, and tags. This programmatic access is essential for building automated ML pipelines where model selection is a code-driven decision, not a manual one.