MLflow is an open-source platform for managing the complete ML lifecycle: experiment tracking, reproducible packaging, model versioning, and deployment. Unlike W&B (a hosted SaaS platform), MLflow can run entirely on your own infrastructure, making it popular in regulated industries and organizations with strict data governance requirements. This section covers MLflow's tracking API, project packaging, and the model registry that manages model versions from development through production.
1. Installation and Setup
MLflow installs as a Python package and includes a local tracking server with a web UI. For team collaboration, you can deploy a remote tracking server backed by a database and artifact store.
# Install MLflow
pip install mlflow
# Start the local tracking UI (runs on port 5000)
# In a terminal:
# mlflow ui --host 0.0.0.0 --port 5000
# In Python, set the tracking URI
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
# Or use a remote server
mlflow.set_tracking_uri("https://mlflow.mycompany.com")
The tracking URI tells the MLflow client where to send data. For local development, the default file-based store (mlruns/ directory) works well. For team use, deploy a tracking server backed by PostgreSQL or MySQL for metadata and S3, Azure Blob, or GCS for artifacts.
2. Experiments and Runs
MLflow organizes work into experiments (groups of related runs) and runs (individual executions). Create an experiment for each project or research question.
import mlflow
# Create or get an experiment
mlflow.set_experiment("llm-fine-tuning")
# Start a run
with mlflow.start_run(run_name="gpt2-lora-baseline") as run:
# Log parameters (hyperparameters, configuration)
mlflow.log_param("model", "gpt2")
mlflow.log_param("learning_rate", 2e-5)
mlflow.log_param("batch_size", 16)
mlflow.log_param("lora_rank", 8)
mlflow.log_param("dataset", "alpaca-52k")
# Log metrics (can be called multiple times for time series)
for epoch in range(3):
train_loss = train_one_epoch(model, dataloader)
val_loss = evaluate(model, val_dataloader)
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
# Log final metrics
mlflow.log_metric("best_val_loss", min(val_losses))
print(f"Run ID: {run.info.run_id}")
The with block ensures the run is closed properly even if an exception occurs. You can also use mlflow.start_run() and mlflow.end_run() explicitly, but the context manager pattern is safer.
MLflow distinguishes between parameters (set once per run, immutable) and metrics (logged repeatedly with a step counter). Log hyperparameters as parameters and training/evaluation scores as metrics. This distinction enables proper filtering and comparison in the UI.
3. Logging Artifacts
Artifacts are files associated with a run: model checkpoints, datasets, plots, configuration files, or any other output you want to preserve.
with mlflow.start_run():
# Log a single file
mlflow.log_artifact("config.yaml")
# Log all files in a directory
mlflow.log_artifacts("./outputs/plots/", artifact_path="plots")
# Log a model checkpoint with metadata
mlflow.log_artifact(
"checkpoints/best_model.pt",
artifact_path="model",
)
# Log a text file with predictions
with open("predictions.txt", "w") as f:
for prompt, output in predictions:
f.write(f"Prompt: {prompt}\nOutput: {output}\n\n")
mlflow.log_artifact("predictions.txt")
Artifacts are stored in the configured artifact store (local filesystem, S3, Azure Blob, or GCS). Each run gets its own artifact directory, so there is no risk of overwriting artifacts from other runs.
4. Autologging
MLflow provides autologging integrations for popular frameworks. Enable autologging and MLflow captures parameters, metrics, and models automatically without manual log_param calls.
import mlflow
# Autolog for PyTorch / Hugging Face Transformers
mlflow.transformers.autolog(
log_input_examples=True,
log_model_signatures=True,
)
# Autolog for scikit-learn
mlflow.sklearn.autolog()
# Now your training code is instrumented automatically
from transformers import Trainer, TrainingArguments
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
),
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# This automatically logs all training args, metrics, and the model
trainer.train()
Use autologging for exploratory work and manual logging for production pipelines. Autologging captures everything, which is convenient but can produce noisy experiment records. For production, explicitly log only the parameters and metrics you need for decision-making.
5. MLflow Projects: Reproducible Packaging
An MLflow Project is a self-contained package that specifies dependencies, entry points, and parameters. It ensures that anyone can reproduce your experiment on any machine.
# MLproject file (YAML format)
# name: llm-fine-tuning
# conda_env: conda.yaml
#
# entry_points:
# train:
# parameters:
# learning_rate: {type: float, default: 2e-5}
# batch_size: {type: int, default: 16}
# epochs: {type: int, default: 3}
# model_name: {type: str, default: "gpt2"}
# command: "python train.py --lr {learning_rate} --bs {batch_size}
# --epochs {epochs} --model {model_name}"
#
# evaluate:
# parameters:
# model_path: {type: str}
# command: "python evaluate.py --model {model_path}"
# Run a project from a Git repo
mlflow.run(
"https://github.com/myorg/llm-finetuning",
entry_point="train",
parameters={
"learning_rate": 3e-5,
"batch_size": 32,
},
)
Projects can be run locally, on a remote cluster (via Kubernetes or Databricks), or from any Git repository. The dependency specification (conda environment or Docker image) ensures the execution environment is identical across runs.
6. The Model Registry
The model registry provides a centralized store for model versions, with stage transitions (Staging, Production, Archived) and access controls.
# Log a model to the registry
with mlflow.start_run():
# Train your model...
mlflow.log_metric("val_accuracy", 0.92)
# Register the model
mlflow.transformers.log_model(
transformers_model={"model": model, "tokenizer": tokenizer},
artifact_path="model",
registered_model_name="gpt2-lora-qa",
)
# Transition a model version to production
from mlflow import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="gpt2-lora-qa",
version=3,
stage="Production",
)
# Load the production model
model_uri = "models:/gpt2-lora-qa/Production"
loaded_model = mlflow.transformers.load_model(model_uri)
The model registry's stage transitions (Staging, Production, Archived) are being replaced by model aliases in newer MLflow versions. Aliases are more flexible: you can define custom tags like "champion" and "challenger" rather than being limited to fixed stages. Check your MLflow version and use the recommended API.
7. Querying Runs Programmatically
The MLflow client API lets you search and compare runs programmatically, which is useful for automated model selection and reporting.
from mlflow import MlflowClient
client = MlflowClient()
# Search for runs matching criteria
runs = client.search_runs(
experiment_ids=["1"],
filter_string="metrics.val_accuracy > 0.85 and params.model = 'gpt2'",
order_by=["metrics.val_accuracy DESC"],
max_results=10,
)
# Print results
for run in runs:
print(
f"Run {run.info.run_id[:8]}: "
f"accuracy={run.data.metrics['val_accuracy']:.3f}, "
f"lr={run.data.params['learning_rate']}"
)
# Find the best run
best_run = runs[0]
print(f"Best run: {best_run.info.run_id}")
The filter string syntax supports comparisons on parameters, metrics, and tags. This programmatic access is essential for building automated ML pipelines where model selection is a code-driven decision, not a manual one.