Section T.3: Databricks: Workspace, Notebooks, and Unity Catalog | Building Conversational AI with LLMs and Agents

Big Picture

Databricks provides a unified analytics platform that combines a collaborative notebook environment, managed Spark clusters, and a governance layer called Unity Catalog. For LLM practitioners, Databricks serves as the backbone for large-scale data preparation, distributed fine-tuning, and production model management. This section walks through workspace setup, notebook workflows, and the Unity Catalog data governance model that underpins enterprise ML.

T.3.1 The Databricks Workspace

A Databricks workspace is a cloud-hosted environment (available on AWS, Azure, and GCP) that bundles compute, storage, notebooks, and job orchestration into a single interface. When you create a workspace, Databricks provisions a control plane that manages cluster lifecycle, user authentication, and access control. The data plane, where your Spark jobs and ML training actually execute, runs inside your own cloud account, keeping sensitive data within your security perimeter.

Workspaces are organized around three core abstractions: clusters (managed Spark runtimes with optional GPU nodes), notebooks (interactive documents that mix code, visualizations, and markdown), and jobs (scheduled or triggered execution of notebooks and scripts). Each workspace supports multiple users with role-based access control (RBAC), enabling teams to share notebooks, datasets, and trained models without duplicating infrastructure.

# Install the Databricks CLI
pip install databricks-cli

# Configure authentication with a personal access token
databricks configure --token
# Host: https://your-workspace.cloud.databricks.com
# Token: dapi1234567890abcdef

# List available clusters
databricks clusters list --output JSON

# Create a GPU cluster for ML training
databricks clusters create --json '{
  "cluster_name": "llm-training-cluster",
  "spark_version": "14.3.x-gpu-ml-scala2.12",
  "node_type_id": "Standard_NC24ads_A100_v4",
  "num_workers": 4,
  "autoscale": {"min_workers": 2, "max_workers": 8},
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true"
  }
}'

Tip

Use autoscaling clusters for exploratory work and fixed-size clusters for production training jobs. Autoscaling adds latency when new nodes spin up, which can cause uneven gradient synchronization during distributed training. For fine-tuning LLMs, pin the cluster size to avoid mid-epoch disruptions.

T.3.2 Databricks Notebooks for ML Development

Databricks notebooks support Python, Scala, SQL, and R within a single document. For LLM workflows, the typical pattern involves a Python notebook that loads data from Delta Lake (see Section T.2), preprocesses it with Spark, and then trains or fine-tunes a model using PyTorch or HuggingFace Transformers. Notebooks can attach to any running cluster, and the %pip magic command installs additional Python packages directly into the cluster environment.

# Databricks notebook cell: install dependencies
%pip install transformers datasets accelerate

# Cell 2: Load training data from Unity Catalog
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Read a Delta table managed by Unity Catalog
training_df = spark.read.table("ml_catalog.llm_data.instruction_pairs")
print(f"Training samples: {training_df.count():,}")
training_df.show(5, truncate=80)

# Cell 3: Convert Spark DataFrame to HuggingFace Dataset
import pandas as pd
from datasets import Dataset

pdf = training_df.select("instruction", "response").toPandas()
hf_dataset = Dataset.from_pandas(pdf)
print(hf_dataset)

Distributed setup: World size: 4 (4 GPUs) Backend: nccl Rank 0: GPU 0 (A100 80GB) Rank 1: GPU 1 (A100 80GB) Rank 2: GPU 2 (A100 80GB) Rank 3: GPU 3 (A100 80GB)

Notebooks also support widgets for parameterized execution. You can define dropdown menus, text inputs, and numeric sliders that become parameters when the notebook runs as a scheduled job. This pattern is valuable for hyperparameter sweeps where the same notebook runs multiple times with different learning rates or batch sizes.

# Create parameterized widgets for training configuration
dbutils.widgets.dropdown("model_name", "meta-llama/Llama-3.1-8B",
    ["meta-llama/Llama-3.1-8B", "mistralai/Mistral-7B-v0.3"])
dbutils.widgets.text("learning_rate", "2e-5")
dbutils.widgets.text("num_epochs", "3")

# Retrieve widget values
model_name = dbutils.widgets.get("model_name")
lr = float(dbutils.widgets.get("learning_rate"))
epochs = int(dbutils.widgets.get("num_epochs"))

print(f"Training {model_name} with lr={lr}, epochs={epochs}")

DDP training: Epoch 1: loss=2.34, lr=1.0e-04 [4 GPUs, 128 samples/step] Epoch 2: loss=1.87, lr=9.5e-05 Epoch 3: loss=1.45, lr=9.0e-05 Total time: 12m 34s (3.1x speedup over single GPU)

T.3.3 Unity Catalog: Governed Data and Model Management

Unity Catalog is the centralized governance layer for all data assets in Databricks. It provides a three-level namespace: catalog (top-level container, often per environment or business unit), schema (logical grouping, similar to a database schema), and table/volume/model (the actual data or ML artifact). This hierarchy enables fine-grained access control. A data engineer can grant a team read access to specific schemas without exposing the entire catalog.

For LLM workflows, Unity Catalog is particularly valuable for three reasons. First, it tracks data lineage, showing which tables were used to produce a training dataset. Second, it manages model versions through the integrated MLflow Model Registry, so you can trace exactly which data and code produced a deployed model. Third, it enforces row-level and column-level security, which is critical when training data contains PII or proprietary content.

-- Create a catalog and schema for LLM training assets
CREATE CATALOG IF NOT EXISTS ml_catalog;
USE CATALOG ml_catalog;

CREATE SCHEMA IF NOT EXISTS llm_data
  COMMENT 'Training data for LLM fine-tuning projects';

-- Create a managed Delta table for instruction-tuning pairs
CREATE TABLE IF NOT EXISTS llm_data.instruction_pairs (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  instruction STRING NOT NULL,
  response STRING NOT NULL,
  source STRING,
  quality_score FLOAT,
  created_at TIMESTAMP DEFAULT current_timestamp()
)
USING DELTA
COMMENT 'Curated instruction-response pairs for SFT training';

-- Grant read access to the ML engineering team
GRANT SELECT ON TABLE llm_data.instruction_pairs
  TO `ml-engineers@company.com`;

Key Insight

Unity Catalog's lineage tracking is essential for LLM compliance. When a regulator asks "what data trained this model?", you can trace from the deployed model version back through the MLflow run to the exact Delta table version, including every transformation applied. This audit trail is impossible to reconstruct after the fact without a governance layer.

T.3.4 MLflow Integration on Databricks

Databricks provides a managed MLflow instance that is tightly integrated with Unity Catalog. Every notebook run can automatically log parameters, metrics, and artifacts to an MLflow experiment. For LLM fine-tuning, this means you can track learning rate schedules, evaluation loss curves, and model checkpoints without any additional infrastructure. The Model Registry in Unity Catalog then lets you promote a trained model through stages (None, Staging, Production) with approval workflows.

import mlflow
from mlflow.models import infer_signature

# Set the MLflow experiment (auto-created if it does not exist)
mlflow.set_experiment("/Users/you@company.com/llm-fine-tuning")

with mlflow.start_run(run_name="llama-sft-v1") as run:
    # Log training hyperparameters
    mlflow.log_params({
        "model_name": "meta-llama/Llama-3.1-8B",
        "learning_rate": 2e-5,
        "batch_size": 8,
        "num_epochs": 3,
        "lora_rank": 16,
        "dataset_version": "v2.1",
    })

    # ... training loop here ...

    # Log metrics at each epoch
    for epoch in range(3):
        mlflow.log_metrics({
            "train_loss": 1.2 - epoch * 0.3,
            "eval_loss": 1.1 - epoch * 0.25,
        }, step=epoch)

    # Register the model in Unity Catalog
    mlflow.transformers.log_model(
        transformers_model={"model": model, "tokenizer": tokenizer},
        artifact_path="llm",
        registered_model_name="ml_catalog.llm_models.llama_sft",
    )

    print(f"Run ID: {run.info.run_id}")

FSDP configuration: Sharding strategy: FULL_SHARD Mixed precision: bf16 Activation checkpointing: enabled Per-GPU memory: 18.4 GB (vs 72.8 GB without FSDP)

T.3.5 Databricks Model Serving

Once a model is registered in Unity Catalog, Databricks Model Serving can deploy it behind a REST endpoint with a single click or API call. The serving infrastructure supports both CPU and GPU endpoints, automatic scaling to zero, and A/B traffic routing between model versions. For LLM workloads, Databricks also offers provisioned throughput endpoints that guarantee a minimum tokens per second rate, which is critical for production chatbot applications (see Appendix S for alternative inference serving engines).

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput,
    ServedEntityInput,
)

w = WorkspaceClient()

# Create a GPU-backed serving endpoint
w.serving_endpoints.create_and_wait(
    name="llama-sft-endpoint",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name="ml_catalog.llm_models.llama_sft",
                entity_version="1",
                workload_size="Small",      # Small / Medium / Large
                scale_to_zero_enabled=True,
                workload_type="GPU_MEDIUM",
            )
        ]
    ),
)

# Query the endpoint
import requests
response = requests.post(
    "https://your-workspace.cloud.databricks.com/serving-endpoints/llama-sft-endpoint/invocations",
    headers={"Authorization": "Bearer dapi..."},
    json={"inputs": {"instruction": "Explain gradient descent."}},
)
print(response.json())

FSDP training: Epoch 1: loss=2.41, memory=18.4GB/GPU Epoch 2: loss=1.92 Epoch 3: loss=1.51 Model size: 7B parameters across 4 GPUs

Warning

Databricks Model Serving endpoints incur costs even when idle if scale-to-zero is disabled. For development and staging environments, always enable scale_to_zero_enabled=True. For production endpoints with latency SLAs, keep instances warm but set appropriate minimum replica counts to control spend.

Summary

Databricks provides the full lifecycle for enterprise LLM development: interactive notebooks for experimentation, managed Spark clusters with GPU support for distributed training, Unity Catalog for data governance and model versioning, and Model Serving for production deployment. The tight integration between these components eliminates the glue code that typically connects separate tools for data engineering, training, and serving. In the next section, we examine Delta Lake and the Lakehouse architecture that powers the storage layer beneath these workflows.