Building Conversational AI with LLMs and Agents
Appendix R

Experiment Tracking: W&B and MLflow

Chapter illustration
Big Picture

Experiment tracking is the practice of systematically logging every training run's hyperparameters, metrics, artifacts, and environment details so that results are reproducible and comparable. Without it, fine-tuning LLMs becomes a guessing game where you lose track of which configuration produced your best checkpoint, which dataset version was used, and why a particular run diverged.

This appendix covers two industry-leading platforms. Weights & Biases (W&B) provides a hosted experiment dashboard with real-time metric visualization, hyperparameter sweeps, model versioning, and collaborative team features. MLflow offers an open-source, self-hosted alternative with experiment tracking, a model registry, and deployment integrations. Both platforms have first-class support for HuggingFace Trainer, PyTorch, and popular fine-tuning libraries.

This material is for anyone running training or fine-tuning jobs, whether full pretraining runs, LoRA adaptations, or DPO alignment. It is also increasingly relevant for LLM evaluation workflows where you want to track prompt variants, evaluation scores, and model comparisons over time.

Experiment tracking integrates tightly with the training workflows in Chapter 14 (Fine-Tuning), Chapter 15 (PEFT), and Chapter 17 (Alignment). For evaluation metrics and benchmarking methodology, see Chapter 29 (Evaluation). Observability in production, which extends tracking beyond training, is covered in Chapter 30 (Observability).

Prerequisites

You should have completed Chapter 14 (Fine-Tuning Fundamentals) so you understand training loops, loss curves, and hyperparameter choices. Familiarity with basic Python and command-line tools is assumed. If you are tracking LLM evaluation rather than training, read Chapter 29 alongside this appendix.

When to Use This Appendix

Reach for experiment tracking whenever you are running more than a handful of training or evaluation experiments. Use W&B when you want a polished hosted dashboard with team collaboration, hyperparameter sweeps, and artifact versioning out of the box. Use MLflow when you need an open-source, self-hosted solution, or when your organization already uses the Databricks ecosystem (MLflow is natively integrated). For production monitoring after deployment, see Chapter 30 rather than this appendix.

Sections