Platforms

Section 19.1

Training is the chapter of the book where compute is the protagonist. The platforms in this section are not Python imports; they are GPU rentals, distributed-training fabrics, and experiment-tracking SaaS dashboards.

Compute provider tiers for LLM training in 2026
Figure 19.1.1: 2026 training-compute provider tiers. Tier 1 (bare-metal spot: Lambda, RunPod, Vast.ai, Hyperbolic, Salad, plus 2024-25 entrants SF Compute and Prime Intellect) hits $1.80 to $3 per H100-hour and serves fine-tuning and single-node training. Tier 2 (specialist clusters with NVLink and InfiniBand: CoreWeave, Crusoe, Nebius, NVIDIA DGX Cloud post-Lepton acquisition, plus Modal and Together AI serverless wrappers) handles multi-node fine-tunes and mid-scale pretraining; Mistral disclosed that Mixtral 8x22B (141B MoE) was trained on Crusoe. Tier 3 (hyperscalers and TPU: AWS Trainium2, Google Trillium TPU v6e, GCP A3 Ultra, Azure NDv5) at about $5 to $8 per H100-hour is the path for 100B-plus pretraining. Cross-cutting 2025 patterns include spot-friendly fsspec and s5cmd checkpointing, OpenDiLoCo for geo-distributed training, and W&B as the default tracker.

19.1.1 Compute providers for training

For Part IV's training-scale workloads, the same provider list from Section 10.6 applies, but the price-performance ratio matters more (and interconnect, multi-node, and persistent storage become first-class concerns). Lambda, RunPod, and Vast.ai dominate the bare-metal-GPU rental market. CoreWeave, Crusoe, and Nebius serve larger clusters with NVLink and InfiniBand interconnect. Modal and Together AI wrap GPU rental in serverless Python APIs that are friendlier for small teams.

Two 2024-25 entrants reshaped the multi-GPU spot market specifically for training. SF Compute is a multi-GPU spot marketplace that became the leading auction-priced training option in 2024-25. Prime Intellect is a decentralized H100 marketplace plus an open-pretraining research org, relevant for academic teams who can tolerate non-collocated GPUs. Hyperbolic Labs and Salad Cloud cover the 4090-class tier as alternatives to vast.ai.

For full pretraining, the hyperscalers (AWS, GCP, Azure) and TPU pods still win on raw scale. AWS Trainium2 (2024) and Google Trillium TPU (2024) are the two serious non-NVIDIA training accelerators in 2026. NVIDIA DGX Cloud (boosted by the 2025 Lepton AI acquisition) covers the enterprise tier. For fine-tuning, the specialists are almost always cheaper.

Two patterns dominate the 2025-26 distributed-training workflow regardless of provider. The first is spot-friendly checkpointing via fsspec + s5cmd (S3-compatible parallel upload). The second is distributed training over commodity Internet via OpenDiLoCo (2024), the open implementation of Google's DiLoCo, which lets academic teams pretrain across geographically separated nodes without InfiniBand.

Real-World Scenario: Mistral and Mixtral 8x22B on Crusoe

Mistral's 2024 disclosure that Mixtral 8x22B was trained on Crusoe's bare-metal H100 cluster grounds the bare-metal-vs-hyperscaler choice: a 141B-parameter MoE base model trained on a single specialist provider, no Big Three cloud involvement. For most teams the lesson is that "needing AWS" is overrated; for production-scale training, the specialist providers are simultaneously cheaper and faster, provided your stack tolerates fewer managed services.

19.1.2 Distributed-training fabrics

19.1.3 Experiment tracking

Warning: Spot instances and checkpointing

Spot rentals are 50 to 70 percent cheaper but can vanish with two minutes notice. Always checkpoint to S3 (or equivalent) every N steps. Recovering a 24-hour fine-tune from scratch is a worse outcome than paying double.

Databricks Workspace and Unity Catalog

Big Picture

Databricks provides a unified analytics platform that combines a collaborative notebook environment, managed Spark clusters, and a governance layer called Unity Catalog. For LLM practitioners, Databricks serves as the backbone for large-scale data preparation, distributed fine-tuning, and production model management. This section walks through workspace setup, notebook workflows, and the Unity Catalog data governance model that underpins enterprise ML.

O.3.1 The Databricks Workspace

A Databricks workspace is a cloud-hosted environment (available on AWS, Azure, and GCP) that bundles compute, storage, notebooks, and job orchestration into a single interface. When you create a workspace, Databricks provisions a control plane that manages cluster lifecycle, user authentication, and access control. The data plane, where your Spark jobs and ML training actually execute, runs inside your own cloud account, keeping sensitive data within your security perimeter.

Workspaces are organized around three core abstractions: clusters (managed Spark runtimes with optional GPU nodes), notebooks (interactive documents that mix code, visualizations, and markdown), and jobs (scheduled or triggered execution of notebooks and scripts). Each workspace supports multiple users with role-based access control (RBAC), enabling teams to share notebooks, datasets, and trained models without duplicating infrastructure.

# Install the Databricks CLI
pip install databricks-cli

# Configure authentication with a personal access token
databricks configure --token
# Host: https://your-workspace.cloud.databricks.com
# Token: dapi1234567890abcdef

# List available clusters
databricks clusters list --output JSON

# Create a GPU cluster for ML training
databricks clusters create --json '{
  "cluster_name": "llm-training-cluster",
  "spark_version": "14.3.x-gpu-ml-scala2.12",
  "node_type_id": "Standard_NC24ads_A100_v4",
  "num_workers": 4,
  "autoscale": {"min_workers": 2, "max_workers": 8},
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true"
  }
}'
Code Fragment 19.1.1: Workspaces are organized around three core abstractions: clusters (managed Spark runtimes with optional GPU nodes), notebooks (interactive documents that mix code, visualizations.
Tip

Use autoscaling clusters for exploratory work and fixed-size clusters for production training jobs. Autoscaling adds latency when new nodes spin up, which can cause uneven gradient synchronization during distributed training. For fine-tuning LLMs, pin the cluster size to avoid mid-epoch disruptions.

O.3.2 Databricks Notebooks for ML Development

Databricks notebooks support Python, Scala, SQL, and R within a single document. For LLM workflows, the typical pattern involves a Python notebook that loads data from Delta Lake (see Section 19.4 (Datasets & Benchmarks)), preprocesses it with Spark, and then trains or fine-tunes a model using PyTorch or Hugging Face Transformers. Notebooks can attach to any running cluster, and the %pip magic command installs additional Python packages directly into the cluster environment.

# Databricks notebook cell: install dependencies
%pip install transformers datasets accelerate
# Cell 2: Load training data from Unity Catalog
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read a Delta table managed by Unity Catalog
training_df = spark.read.table("ml_catalog.llm_data.instruction_pairs")
print(f"Training samples: {training_df.count():,}")
training_df.show(5, truncate=80)
# Cell 3: Convert Spark DataFrame to HuggingFace Dataset
import pandas as pd
from datasets import Dataset
pdf = training_df.select("instruction", "response").toPandas()
hf_dataset = Dataset.from_pandas(pdf)
print(hf_dataset)
Output: Distributed setup: World size: 4 (4 GPUs) Backend: nccl Rank 0: GPU 0 (A100 80GB) Rank 1: GPU 1 (A100 80GB) Rank 2: GPU 2 (A100 80GB) Rank 3: GPU 3 (A100 80GB)
Code Fragment 19.1.2: Databricks notebook cell: install dependencies

Notebooks also support widgets for parameterized execution. You can define dropdown menus, text inputs, and numeric sliders that become parameters when the notebook runs as a scheduled job. This pattern is valuable for hyperparameter sweeps where the same notebook runs multiple times with different learning rates or batch sizes.

# Create parameterized widgets for training configuration
dbutils.widgets.dropdown("model_name", "meta-llama/Llama-3.1-8B",
    ["meta-llama/Llama-3.1-8B", "mistralai/Mistral-7B-v0.3"])
dbutils.widgets.text("learning_rate", "2e-5")
dbutils.widgets.text("num_epochs", "3")

# Retrieve widget values
model_name = dbutils.widgets.get("model_name")
lr = float(dbutils.widgets.get("learning_rate"))
epochs = int(dbutils.widgets.get("num_epochs"))

print(f"Training {model_name} with lr={lr}, epochs={epochs}")
Output: DDP training: Epoch 1: loss=2.34, lr=1.0e-04 [4 GPUs, 128 samples/step] Epoch 2: loss=1.87, lr=9.5e-05 Epoch 3: loss=1.45, lr=9.0e-05 Total time: 12m 34s (3.1x speedup over single GPU)
Code Fragment 19.1.3: Create parameterized widgets for training configuration

O.3.3 Unity Catalog: Governed Data and Model Management

Unity Catalog is the centralized governance layer for all data assets in Databricks. It provides a three-level namespace: catalog (top-level container, often per environment or business unit), schema (logical grouping, similar to a database schema), and table/volume/model (the actual data or ML artifact). This hierarchy enables fine-grained access control. A data engineer can grant a team read access to specific schemas without exposing the entire catalog.

For LLM workflows, Unity Catalog is particularly valuable for three reasons. First, it tracks data lineage, showing which tables were used to produce a training dataset. Second, it manages model versions through the integrated MLflow Model Registry, so you can trace exactly which data and code produced a deployed model. Third, it enforces row-level and column-level security, which is critical when training data contains PII or proprietary content.

-- Create a catalog and schema for LLM training assets
CREATE CATALOG IF NOT EXISTS ml_catalog;
USE CATALOG ml_catalog;

CREATE SCHEMA IF NOT EXISTS llm_data
  COMMENT 'Training data for LLM fine-tuning projects';

-- Create a managed Delta table for instruction-tuning pairs
CREATE TABLE IF NOT EXISTS llm_data.instruction_pairs (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  instruction STRING NOT NULL,
  response STRING NOT NULL,
  source STRING,
  quality_score FLOAT,
  created_at TIMESTAMP DEFAULT current_timestamp()
)
USING DELTA
COMMENT 'Curated instruction-response pairs for SFT training';

-- Grant read access to the ML engineering team
GRANT SELECT ON TABLE llm_data.instruction_pairs
  TO `ml-engineers@company.com`;
Code Fragment 19.1.4: Sql example
Key Insight

Unity Catalog's lineage tracking is essential for LLM compliance. When a regulator asks "what data trained this model?", you can trace from the deployed model version back through the MLflow run to the exact Delta table version, including every transformation applied. This audit trail is impossible to reconstruct after the fact without a governance layer.

O.3.4 MLflow Integration on Databricks

Databricks provides a managed MLflow instance that is tightly integrated with Unity Catalog. Every notebook run can automatically log parameters, metrics, and artifacts to an MLflow experiment. For LLM fine-tuning, this means you can track learning rate schedules, evaluation loss curves, and model checkpoints without any additional infrastructure. The Model Registry in Unity Catalog then lets you promote a trained model through stages (None, Staging, Production) with approval workflows.

import mlflow
from mlflow.models import infer_signature

# Set the MLflow experiment (auto-created if it does not exist)
mlflow.set_experiment("/Users/you@company.com/llm-fine-tuning")

with mlflow.start_run(run_name="llama-sft-v1") as run:
    # Log training hyperparameters
    mlflow.log_params({
        "model_name": "meta-llama/Llama-3.1-8B",
        "learning_rate": 2e-5,
        "batch_size": 8,
        "num_epochs": 3,
        "lora_rank": 16,
        "dataset_version": "v2.1",
    })

    # ... training loop here ...

    # Log metrics at each epoch
    for epoch in range(3):
        mlflow.log_metrics({
            "train_loss": 1.2 - epoch * 0.3,
            "eval_loss": 1.1 - epoch * 0.25,
        }, step=epoch)

    # Register the model in Unity Catalog
    mlflow.transformers.log_model(
        transformers_model={"model": model, "tokenizer": tokenizer},
        artifact_path="llm",
        registered_model_name="ml_catalog.llm_models.llama_sft",
    )

    print(f"Run ID: {run.info.run_id}")
Output: FSDP configuration: Sharding strategy: FULL_SHARD Mixed precision: bf16 Activation checkpointing: enabled Per-GPU memory: 18.4 GB (vs 72.8 GB without FSDP)
Code Fragment 19.1.5: Set the MLflow experiment (auto-created if it does not exist)

O.3.5 Databricks Model Serving

Once a model is registered in Unity Catalog, Databricks Model Serving can deploy it behind a REST endpoint with a single click or API call. The serving infrastructure supports both CPU and GPU endpoints, automatic scaling to zero, and A/B traffic routing between model versions. For LLM workloads, Databricks also offers provisioned throughput endpoints that guarantee a minimum tokens per second rate, which is critical for production chatbot applications (see Section 10.7 (vLLM Deep Dive) for alternative inference serving engines).

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput,
    ServedEntityInput,
)

w = WorkspaceClient()

# Create a GPU-backed serving endpoint
w.serving_endpoints.create_and_wait(
    name="llama-sft-endpoint",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name="ml_catalog.llm_models.llama_sft",
                entity_version="1",
                workload_size="Small",      # Small / Medium / Large
                scale_to_zero_enabled=True,
                workload_type="GPU_MEDIUM",
            )
        ]
    ),
)

# Query the endpoint
import requests
response = requests.post(
    "https://your-workspace.cloud.databricks.com/serving-endpoints/llama-sft-endpoint/invocations",
    headers={"Authorization": "Bearer dapi..."},
    json={"inputs": {"instruction": "Explain gradient descent."}},
)
print(response.json())
Output: {"predictions": [{"text": "Gradient descent is an iterative optimization algorithm..."}]}
Code Fragment 19.1.6: Provision a GPU-backed Databricks Model Serving endpoint from a Unity Catalog model version and call it via REST. The scale_to_zero_enabled=True flag is what makes the endpoint cheap for staging; remove it for low-latency production.
Warning

Databricks Model Serving endpoints incur costs even when idle if scale-to-zero is disabled. For development and staging environments, always enable scale_to_zero_enabled=True. For production endpoints with latency SLAs, keep instances warm but set appropriate minimum replica counts to control spend.

Summary

Databricks provides the full lifecycle for enterprise LLM development: interactive notebooks for experimentation, managed Spark clusters with GPU support for distributed training, Unity Catalog for data governance and model versioning, and Model Serving for production deployment. The tight integration between these components eliminates the glue code that typically connects separate tools for data engineering, training, and serving. In the next section, we examine Delta Lake and the Lakehouse architecture that powers the storage layer beneath these workflows.

What's Next?

In the next section, Section 19.11: Weights and Biases Deep Dive, we build on the material covered here.

Further Reading

Distributed Training Frameworks

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." KDD 2020. dl.acm.org/doi/10.1145/3394486.3406703. The DeepSpeed paper.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053. The Megatron-LM paper.
Zhao, Y., Gu, A., Varma, R., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB 2023. arXiv:2304.11277. The FSDP paper.

Experiment Tracking

Weights & Biases (2024). "W&B Documentation." docs.wandb.ai. Reference experiment-tracking platform.
MLflow (2024). "MLflow Documentation." mlflow.org/docs/latest. Reference open-source experiment tracker.