Section 19.5: Models

The base models you fine-tune in Part IV are the open-weights frontier from Chapter 12, with one wrinkle: not every model in that list is a good fine-tuning target. Some checkpoints were released with permissive licenses; others have license riders that affect commercial use. This section names the practical defaults.

19.5.1 Base models for fine-tuning

Llama-3.3 70B (Meta, Dec 2024) and the Llama 4 family (Meta, 2025-Q2: Scout, Maverick, with Behemoth previewed) are the open-weights flagship line from Meta's GenAI team and remain the most-fine-tuned family by a wide margin. Their objective for fine-tuners is to provide a battle-tested architecture with extensive tooling support (axolotl recipes, vLLM kernels, GGUF quantizations) so you can focus on the data, which matters because most published recipes assume Llama as the base. The core concept is a dense decoder-only transformer (Llama-3.3) or MoE (Llama 4 Scout / Maverick) with grouped-query attention and a 128K+ context; pick Llama-3.3 8B-class for single-GPU work, 70B for 8x H100 fine-tunes, Llama 4 Scout for state-of-the-art open quality. Check the Llama Community License's 700M MAU clause and the 2025 acceptable-use updates; for almost every developer, it is functionally permissive, but enterprise legal teams want it documented.
Qwen3 (Alibaba, 2025; 0.5B-235B) is the Qwen team's third-generation open release covering a wider size range than Llama and shipping Apache 2.0 for most sizes. Its objective is to be the open default for multilingual workloads, with notably stronger Chinese, Japanese, Korean, and Arabic coverage than the Llama family, which matters when your application is not English-only. The core concept is a dense + MoE family (235B-A22B is the active MoE flagship) sharing the same tokenizer and template across sizes. Pick Qwen3 for multilingual use cases or when you want sizes Llama does not publish (1.5B, 4B, 14B, 32B fill important gaps).
Mistral / Mixtral (Mistral AI, 2023+) is the French open-weights lineage that introduced sliding-window attention (7B) and competitive mixture-of-experts (Mixtral 8x7B, 8x22B). Its objective is to ship Apache 2.0 weights for small-to-mid models with above-class performance, which matters when your downstream license demands maximum permissiveness. The core concept for fine-tuners is that Mistral 7B's tokenizer and template are well-supported in TRL and the model fits comfortably in 24 GB for QLoRA. Pick Mistral 7B as the single-GPU LoRA default; for production at scale, Mistral Large requires a separate commercial license.
Gemma 3 (Google DeepMind, 2025; 2B, 9B, 27B) is Google's open release built from Gemini's research codebase. Its objective is to offer edge-friendly model sizes (the 2B runs on a phone, the 9B fits in 24 GB easily) with strong out-of-the-box quality, which matters for on-device or constrained deployments. The core concept is a dense decoder with a tokenizer optimized for multilingual efficiency and a long 128K context on the 27B. The Gemma license adds use restrictions Apache 2.0 does not, so read it; for many enterprises, Llama-3 8B is a smoother license fit than Gemma 9B even if quality is similar.
DeepSeek-V3 / V3.1 and DeepSeek-R1 (DeepSeek, 2024-2025) is the cost-efficient MoE frontier from a Chinese hedge-fund-backed lab whose weights are MIT-licensed. Its objective is to train and ship frontier-class models at a fraction of the cost of US labs (V3 reported around $5.6M of training compute), which matters when you want to do research fine-tunes at the truly large scale. The core concept is a heavily-MoE architecture (256 experts in V3) with only ~37B parameters active per token, plus the GRPO algorithm that powered the R1 reasoning variant. Pick DeepSeek when you need MoE economics and have the multi-node infrastructure to fine-tune it; avoid on single-GPU setups where it does not fit.
Phi-4 (Microsoft, 2024-12; 14B) is a small dense model trained on heavily-curated synthetic data with strong reasoning per parameter. Good fine-tuning base when you want a dense 14B that fits on a single 24 GB GPU at LoRA and beats Llama 8B on many reasoning tasks.
Granite 3 (IBM, 2024) is an Apache 2.0 open family relevant to enterprise readers who need a fully-permissive base; sizes 2B, 8B; coverage for code, RAG, and tool use.
Distilled-reasoning bases: DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Qwen-32B are now the preferred starting points for reasoning-task fine-tunes; they inherit R1's reasoning style and are far cheaper to continue-tune than retraining from scratch.
Task-specific Qwen variants: Qwen3-Coder and Qwen3-Math are the right bases when your downstream task is code or math specifically; they save you from re-running a domain SFT round on top of a general base.
Pythia (EleutherAI, 2023) is a research-purposes model suite that ships training checkpoints every 1000 steps from 14M to 12B parameters. Its objective is to enable mechanistic interpretability and training-dynamics studies, which matters when you need to study how a behavior emerged across training rather than just what the final model does. Pick Pythia exclusively for research; for production fine-tuning, the modern open frontier is far stronger.

19.5.2 Pre-tuned models as starting points

For most production fine-tunes, you start from an already-instruction-tuned checkpoint, not the base model. Specifically:

Llama-4 Instruct (Meta, 2025) is the official instruction-tuned variant that ships alongside each base release. Its objective is to give a strong chat-ready checkpoint without you having to redo SFT and DPO yourself, which matters when your fine-tune is a domain adaptation rather than a from-scratch instruction tune. Pick Instruct when adding a vertical (legal, medical, code review) on top of conversational capability; start from base when you want to enforce your own chat template or behavior.
Qwen3 Instruct (Alibaba, 2025) is the official instruction-tuned variant of Qwen3 across all sizes. Its objective parallels Llama Instruct, with the added benefit of stronger multilingual chat behavior out of the box. Pick when your fine-tune targets multilingual applications and you do not want to retrain the multilingual chat template yourself.
Tulu 3 (AllenAI, 2024) is AllenAI's fully-open SFT + DPO + RLVR recipe applied to Llama-3 bases, shipping the only state-of-the-art open instruction-tuned model with the complete training pipeline disclosed. Its objective is to provide a reproducible reference: every dataset, every hyperparameter, every eval is published, which matters when you need to defend the provenance of your fine-tuned model. Pick Tulu 3 when you want a known-quantity starting point whose lineage you can fully audit.
Zephyr (Hugging Face H4 team, 2023) is the canonical DPO-tuned Mistral 7B series, born as the alignment-handbook's reference recipe. Its objective is to demonstrate what a clean DPO pipeline on UltraFeedback can do on top of a small base, which matters because Zephyr is the worked example most tutorials cite. Pick Zephyr today as a teaching reference; for production, Tulu 3 8B is a stronger small-model alignment target.

19.5.3 Comparing the bases

Table 19.5.1: 21.4.1 Base models for Part IV fine-tuning.

Model	Sizes	License	Best for
Llama-4	8B-405B	Llama Community	General-purpose, most-supported
Qwen3	0.5B-235B	Apache 2.0 (most sizes)	Multilingual, MoE option
Mistral	7B / Mixtral / Large	Apache 2.0 (small)	Single-GPU LoRA
Gemma 3	2B-27B	Gemma license	Edge-friendly sizes
DeepSeek-V4	~671B MoE	MIT	Research-scale fine-tunes

19.5.4 Defaults by hardware

On a single 24 GB consumer GPU, Llama-4 8B or Qwen3 7B with QLoRA is the sweet spot. On a single 80 GB H100, you can full-fine-tune up to ~13B or QLoRA up to ~70B. For anything above 70B, multi-GPU FSDP/DeepSpeed is required, covered in Appendix L.

Key Insight: License-aware model selection

If you intend to sell access to the resulting model, the license matters more than the benchmark score. Apache 2.0 and MIT are the safest; Llama Community License is fine for almost everyone but check the MAU threshold; Gemma and Qwen-MoE have their own clauses worth reading. Appendix O (Docker and Containers) includes the consolidated license table.

Warning

Llama 4 added new acceptable-use restrictions in 2025

Beyond the 700M-MAU clause that the LCL has always had, Llama 4 (2025-Q2) added an updated use-case list that explicitly restricts certain critical-infrastructure and military applications, plus an agreement-gating step for download (you must accept the license on the Hugging Face Hub before AutoModelForCausalLM.from_pretrained works). Gemma similarly requires acceptance. Apache 2.0 and MIT downloads do not gate. If your CI pipeline pulls bases at build time, factor in the per-model gating step.

Databricks AI and Foundation Models

Big Picture

Databricks has evolved from a data engineering platform into a full-stack AI platform. With the Mosaic AI suite, Databricks provides managed infrastructure for training large models, hosted foundation model APIs, SQL-native LLM functions, MLflow-based lifecycle management, and a built-in vector search service. This section covers each layer of that stack, from fine-tuning custom models with Composer to building end-to-end RAG applications that keep all components inside the Databricks Lakehouse. Teams that already use Databricks for data engineering can reach production-grade AI applications without introducing separate infrastructure for training, serving, or retrieval.

O.4.1 Mosaic AI: Training LLMs on Databricks

Mosaic AI is the unified AI development layer inside Databricks, built on technology from MosaicML (acquired by Databricks in 2023). It bundles two open-source libraries: Composer (a PyTorch training library with built-in efficiency algorithms) and Streaming (a dataset format designed for efficient random-access reads from object storage during distributed training). Together they let teams train or fine-tune large models on Databricks clusters without managing Ray or DeepSpeed directly, though those backends can be used underneath. See Section O.3 for the underlying Ray integration and Hugging Face Transformers for Hugging Face model loading patterns that combine naturally with Composer.

The Mosaic AI managed fine-tuning endpoint requires no cluster configuration. You submit a training job via the Databricks UI or REST API, specifying a base model, a Delta table of instruction pairs, and hyperparameters. Databricks provisions the compute, runs the training loop, logs results to MLflow, and registers the resulting model in Unity Catalog.

import requests
import json

# Submit a managed fine-tuning job via the Databricks REST API
DATABRICKS_HOST = "https://your-workspace.azuredatabricks.net"
TOKEN = "dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

payload = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "train_data_path": "dbfs:/user/data/instruction_pairs_train",
    "eval_data_path": "dbfs:/user/data/instruction_pairs_eval",
    "training_duration": "3ep",           # 3 epochs
    "learning_rate": 5e-6,
    "register_to": "ml_catalog.llm_models.llama_support_ft",
    "data_prep_cluster_id": "0101-123456-abcdefgh",
}

# Production code should wrap this in try/except for network errors.
response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/fine-tuning/runs/create",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json=payload,
)
run_id = response.json()["run_id"]
print(f"Fine-tuning run started: {run_id}")

Code Fragment 19.5.1: Submit a managed fine-tuning job via the Databricks REST API

from composer import Trainer
from composer.models import HuggingFaceModel
from composer.optim import DecoupledAdamW
from streaming import StreamingDataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Meta-Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16
)

# Wrap in Composer's HuggingFace adapter
composer_model = HuggingFaceModel(
    model=base_model,
    tokenizer=tokenizer,
    use_logits=True,
)

# Streaming dataset from DBFS (MDS format written by Spark)
train_dataset = StreamingDataset(
    remote="s3://my-bucket/training-data/mds/",
    local="/tmp/streaming_cache/",
    shuffle=True,
    batch_size=8,
)

trainer = Trainer(
    model=composer_model,
    train_dataloader=torch.utils.data.DataLoader(train_dataset, batch_size=8),
    max_duration="3ep",
    optimizers=DecoupledAdamW(composer_model.parameters(), lr=5e-6, weight_decay=0.0),
    device="gpu",
    precision="bf16_mixed",
    save_folder="/dbfs/user/checkpoints/llama_sft/",
    loggers=[],  # MLflow logger added via mlflow.start_run() context
)
trainer.fit()

Code Fragment 19.5.2: Working with composer, Trainer, HuggingFaceModel.

from openai import OpenAI

# The openai client works with Databricks Foundation Model APIs
# by overriding the base_url and using a Databricks PAT as the key
client = OpenAI(
    api_key="dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    base_url="https://your-workspace.azuredatabricks.net/serving-endpoints",
)

response = client.chat.completions.create(
    model="databricks-dbrx-instruct",   # or databricks-meta-llama-3-1-70b-instruct
    messages=[
        {"role": "system", "content": "You are a helpful data engineering assistant."},
        {"role": "user", "content": "Explain the difference between Delta Live Tables and standard Delta tables."},
    ],
    max_tokens=512,
    temperature=0.3,
)
print(response.choices[0].message.content)

Output: Delta Live Tables (DLT) is a declarative ETL framework built on top of Delta Lake. The difference is in management: with standard Delta tables, you write and orchestrate your own pipelines (Spark jobs, schedulers, retries, schema enforcement). With DLT, you declare the expected table dependencies and quality expectations, and Databricks manages the orchestration, data quality monitoring, and incremental updates for you.

Code Fragment 19.5.3: Submitting a managed fine-tuning job via the Databricks REST API. The run logs automatically to MLflow and registers the finished model in Unity Catalog.

For teams that need more control, Composer can be used directly on a Databricks cluster. The StreamingDataset class reads sharded MDS-format data from DBFS or cloud storage with shuffle buffers that keep GPU utilisation high even at slow network speeds.

-- Classify customer support tickets by category using ai_classify()
SELECT
    ticket_id,
    ticket_text,
    ai_classify(
        ticket_text,
        ARRAY('billing', 'technical', 'account_access', 'feature_request', 'other')
    ) AS category,
    ai_extract(
        ticket_text,
        ARRAY('product_name', 'error_code', 'urgency_level')
    ) AS extracted_fields,
    ai_summarize(ticket_text, 50) AS short_summary
FROM raw_support_tickets
WHERE created_date >= CURRENT_DATE - INTERVAL 1 DAY;

Code Fragment 19.5.4: Direct Composer training on a Databricks GPU cluster. StreamingDataset streams shards from object storage, avoiding the need to copy the full dataset to local disk before training begins.

Note

The MPT model family (MPT-7B, MPT-30B) was trained by MosaicML using Composer and Streaming. These models use ALiBi positional encodings and were designed for long-context use. They remain useful as base models for domain-specific fine-tuning, especially when a commercially licensable model is required. See Hugging Face Transformers for loading MPT models from the Hugging Face Hub.

O.4.2 Foundation Model APIs

Databricks Foundation Model APIs provide pay-per-token access to hosted open-weight models (DBRX, Llama-3, Mistral, Mixtral) without the overhead of managing GPU infrastructure. The API follows the OpenAI Chat Completions format, so the standard openai client library works by pointing it at your Databricks workspace URL. This makes it straightforward to swap between hosted Databricks models and external providers such as OpenAI or Anthropic during development. For background on the OpenAI-compatible API contract, see Chapter 11.

-- Use ai_query() for custom prompts inside a Delta Live Tables pipeline
CREATE OR REFRESH STREAMING TABLE product_reviews_enriched AS
SELECT
    review_id,
    product_id,
    review_text,
    rating,
    -- Call the Foundation Model API for sentiment and entity extraction
    ai_query(
        'databricks-meta-llama-3-1-70b-instruct',
        CONCAT(
            'Analyze this product review. Return a JSON object with keys: ',
            '"sentiment" (positive/negative/neutral), ',
            '"mentioned_features" (array of product features mentioned), ',
            '"improvement_suggestions" (array of suggestions, empty if none). ',
            'Review: ', review_text
        )
    ) AS llm_analysis,
    -- Also classify sentiment directly
    ai_classify(review_text, ARRAY('positive', 'negative', 'neutral')) AS sentiment_label
FROM STREAM(LIVE.raw_product_reviews);

Code Fragment 19.5.5: Accessing Databricks Foundation Model APIs using the OpenAI client library. The only changes from a standard OpenAI call are the base_url and the Databricks personal access token.

Provisioned throughput is an alternative pricing tier that reserves dedicated capacity measured in tokens per second. It is appropriate when latency must be predictable (no cold starts, no queuing behind other tenants) or when usage volume makes per-token pricing uneconomical. Provisioned throughput endpoints are created in the Databricks Serving UI and billed per token-per-second of reserved capacity per hour, regardless of actual usage.

Key Insight

For enterprise use cases, Databricks Foundation Model APIs have a governance advantage over external providers: all data stays inside your cloud tenant, audit logs are captured in Unity Catalog, and model access is governed by the same RBAC policies as your data. When comparing with OpenAI or Anthropic APIs, factor in data residency requirements, latency SLAs, and the operational cost of managing separate API keys and audit trails.

O.4.3 AI Functions: LLM Calls from SQL

Databricks AI Functions expose LLM capabilities as built-in SQL functions, allowing data engineers to invoke models directly inside SQL queries, Delta Live Tables pipelines, and dbt models. The core function is ai_query(), which calls any Foundation Model API endpoint. Wrapper functions like ai_classify(), ai_extract(), and ai_summarize() provide structured interfaces for common text analytics tasks. This integrates naturally with the Delta Lake architecture covered in Section 19.4 (Data Pipeline Tooling) and the PySpark pipelines from Section 19.4 (Data Pipeline Tooling).

import mlflow
import mlflow.openai
from mlflow.models import infer_signature
import pandas as pd

mlflow.set_experiment("/Users/you@company.com/llm-finetuning-v3")

with mlflow.start_run(run_name="llama-support-ft-eval"):
    # Log hyperparameters
    mlflow.log_params({
        "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "learning_rate": 5e-6,
        "training_epochs": 3,
        "train_samples": 12400,
    })

    # Log prompt template used for fine-tuning as an artifact
    mlflow.log_text(
        "You are a helpful support agent. Answer the customer's question concisely.\n\nQuestion: {question}\n\nAnswer:",
        "prompt_template.txt",
    )

    # Evaluate the fine-tuned model with mlflow.evaluate()
    eval_data = pd.DataFrame({
        "inputs": ["How do I reset my password?", "Where is my order?"],
        "ground_truth": ["Go to Settings > Security > Reset Password.", "Check your order status at orders.example.com."],
    })

    results = mlflow.evaluate(
        model="endpoints:/ml_catalog.llm_models.llama_support_ft/1",
        data=eval_data,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.faithfulness(),
            mlflow.metrics.answer_relevance(),
            mlflow.metrics.toxicity(),
        ],
    )

    # Log aggregate metrics from the evaluation
    mlflow.log_metrics({
        "faithfulness_mean": results.metrics["faithfulness/v1/mean"],
        "relevance_mean": results.metrics["answer_relevance/v1/mean"],
        "toxicity_ratio": results.metrics["toxicity/v1/ratio"],
    })

    print(results.tables["eval_results_table"])

Code Fragment 19.5.6: Using ai_classify(), ai_extract(), and ai_summarize() to enrich raw support tickets in a single SQL query. Results can be written directly to a Delta table.

ai_query() provides lower-level access, letting you pass a custom prompt and parse the response. It accepts a model endpoint name and a prompt string, and returns the model's text response as a SQL string.

from mlflow.tracking import MlflowClient

client = MlflowClient()
model_name = "ml_catalog.llm_models.llama_support_ft"

# Transition version 3 to Champion (replaces the previous champion)
client.set_registered_model_alias(
    name=model_name,
    alias="Champion",
    version="3",
)

# Create or update a Model Serving endpoint to use the new champion
import requests, json

DATABRICKS_HOST = "https://your-workspace.azuredatabricks.net"
TOKEN = "dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

endpoint_config = {
    "name": "support-llm-endpoint",
    "config": {
        "served_models": [{
            "name": "llama-support-ft-champion",
            "model_name": model_name,
            "model_version": "3",
            "workload_size": "Small",
            "scale_to_zero_enabled": True,
        }]
    }
}

requests.put(
    f"{DATABRICKS_HOST}/api/2.0/serving-endpoints/support-llm-endpoint/config",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json=endpoint_config["config"],
)

Code Fragment 19.5.7: Enriching a streaming Delta Live Tables pipeline with AI functions. The LLM call runs once per row as the stream processes; use provisioned throughput endpoints to control latency and cost at high row volumes.

Warning

AI functions add per-token cost to every SQL query or pipeline run that invokes them. At large scales (millions of rows), costs can exceed the compute cost of the query itself. Use column filters and partition pruning to limit rows processed, cache results in a Delta table rather than re-running on every query, and consider batching with the Python API instead of per-row SQL calls for bulk enrichment jobs.

O.4.4 MLflow for LLM Lifecycle Management

MLflow 2.x introduced first-class support for LLM workflows: logging prompts and completions as structured artifacts, evaluating outputs with LLM-specific metrics, and packaging custom LLM chains as pyfunc models for deployment. On Databricks, MLflow is pre-installed and integrated with Unity Catalog, so experiments, runs, and registered models all participate in the same access control and lineage tracking as your data assets. For the broader context of experiment tracking, see Weights & Biases; for LLM evaluation methodology, see Chapter 42.

from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

# Create a vector search endpoint (compute that serves queries)
vsc.create_endpoint(
    name="llm-rag-endpoint",
    endpoint_type="STANDARD",
)

# Create a Delta Sync index: Databricks computes embeddings automatically
# using the specified embedding model endpoint, and re-syncs when the
# source Delta table changes
index = vsc.create_delta_sync_index(
    endpoint_name="llm-rag-endpoint",
    source_table_name="ml_catalog.rag_data.product_docs",
    index_name="ml_catalog.rag_data.product_docs_index",
    pipeline_type="TRIGGERED",           # or CONTINUOUS for near-real-time sync
    primary_key="doc_id",
    embedding_source_column="content",   # column containing text to embed
    embedding_model_endpoint_name="databricks-bge-large-en",
)

Code Fragment 19.5.8: Using mlflow.evaluate() to assess a fine-tuned model on faithfulness, relevance, and toxicity. Results are logged to the MLflow run for comparison across model versions.

After evaluation, models are promoted through the Unity Catalog Model Registry stages (None, Staging, Champion) via the MLflow client or Databricks UI. Deployment to a Databricks Model Serving endpoint can then be triggered programmatically, completing the CI/CD loop described in Section O.5.

# Query the vector search index at inference time
results = index.similarity_search(
    query_text="How do I configure SSO with Azure AD?",
    columns=["doc_id", "title", "content", "last_updated"],
    num_results=5,
    filters={"doc_type": "integration_guide"},  # optional metadata filter
)

for hit in results["result"]["data_array"]:
    doc_id, title, content, last_updated, score = hit
    print(f"[{score:.3f}] {title} ({doc_id})")

Output: [0.847] Configuring SSO with Azure AD: SAML setup (DOC-1083) [0.821] Identity Provider Integration Guide (DOC-0742) [0.794] Azure AD Group Sync for SCIM Provisioning (DOC-1124) [0.768] Troubleshooting SSO Login Errors (DOC-0918) [0.741] Single Sign-On Overview and Architecture (DOC-0501)

Code Fragment 19.5.9: Promoting a model version to the Champion alias in Unity Catalog and updating a Model Serving endpoint to serve it. The alias decouples the endpoint configuration from the specific version number.

O.4.5 Databricks Vector Search

Databricks Vector Search is a managed vector index that lives alongside Delta tables in the Lakehouse. Unlike standalone vector databases, it synchronises automatically with its source Delta table: when rows are added, updated, or deleted, the index updates accordingly without requiring a separate ingestion pipeline. Embeddings can be computed by a Foundation Model API embedding endpoint and stored in the index, or you can supply pre-computed embedding columns. This end-to-end integration reduces the operational surface for RAG applications significantly compared to maintaining a separate service such as Pinecone or Weaviate alongside Databricks. For comparison with standalone vector databases, see Chapter 32.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, monotonically_increasing_id, explode
from pyspark.sql.types import ArrayType, StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

# UDF: split document text into overlapping chunks
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64):
    """Split text into chunks of ~chunk_size characters with overlap."""
    if not text:
        return []
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append({"chunk_text": text[start:end], "chunk_index": len(chunks)})
        if end == len(text):
            break
        start += chunk_size - overlap
    return chunks

chunk_schema = ArrayType(StructType([
    StructField("chunk_text", StringType()),
    StructField("chunk_index", IntegerType()),
]))
chunk_udf = udf(chunk_text, chunk_schema)

# Explode raw documents into chunk rows
raw_docs = spark.table("ml_catalog.rag_data.raw_documents")

chunked = (
    raw_docs
    .withColumn("chunks", chunk_udf(col("content")))
    .select("doc_id", "title", "doc_type", "last_updated", explode("chunks").alias("chunk"))
    .select(
        "doc_id", "title", "doc_type", "last_updated",
        col("chunk.chunk_text").alias("content"),
        col("chunk.chunk_index").alias("chunk_index"),
    )
    .withColumn("chunk_id", monotonically_increasing_id().cast("string"))
)

chunked.write.format("delta").mode("append").saveAsTable("ml_catalog.rag_data.product_docs")

Code Fragment 19.5.10: PySpark ingestion pipeline that chunks raw documents using an overlapping window UDF and writes them to the Delta table backing the Vector Search index. When this job runs, the index re-syncs automatically.

import mlflow
import mlflow.pyfunc
from databricks.vector_search.client import VectorSearchClient
from openai import OpenAI

class DatabricksRAGChain(mlflow.pyfunc.PythonModel):
    """MLflow pyfunc wrapper for a Databricks-native RAG chain."""

    def load_context(self, context):
        self.vsc = VectorSearchClient()
        self.index = self.vsc.get_index(
            endpoint_name="llm-rag-endpoint",
            index_name="ml_catalog.rag_data.product_docs_index",
        )
        self.llm = OpenAI(
            api_key=context.model_config["databricks_token"],
            base_url=context.model_config["databricks_host"] + "/serving-endpoints",
        )
        self.model_name = context.model_config.get(
            "generation_model", "databricks-dbrx-instruct"
        )

    def predict(self, context, model_input):
        questions = model_input["question"].tolist()
        answers = []

        for question in questions:
            # Retrieve top-5 relevant chunks
            results = self.index.similarity_search(
                query_text=question,
                columns=["title", "content"],
                num_results=5,
            )
            context_text = "\n\n".join(
                f"[{row[0]}]\n{row[1]}"
                for row in results["result"]["data_array"]
            )

            # Generate answer with retrieved context
            response = self.llm.chat.completions.create(
                model=self.model_name,
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "Answer the question using only the provided context. "
                            "If the context does not contain the answer, say so."
                        ),
                    },
                    {
                        "role": "user",
                        "content": f"Context:\n{context_text}\n\nQuestion: {question}",
                    },
                ],
                max_tokens=512,
                temperature=0.1,
            )
            answers.append(response.choices[0].message.content)

        import pandas as pd
        return pd.DataFrame({"answer": answers})

# Log the chain to MLflow and register it
with mlflow.start_run(run_name="rag-chain-v1"):
    model_config = {
        "databricks_host": "https://your-workspace.azuredatabricks.net",
        "databricks_token": "{{secrets/rag-scope/databricks-token}}",
        "generation_model": "databricks-dbrx-instruct",
    }
    mlflow.pyfunc.log_model(
        artifact_path="rag_chain",
        python_model=DatabricksRAGChain(),
        model_config=model_config,
        registered_model_name="ml_catalog.rag_apps.product_docs_rag",
    )

Code Fragment 19.5.11: A complete RAG chain packaged as an MLflow pyfunc model. The predict method queries the Vector Search index with metadata filters, assembles context, and calls the generation LLM. Once logged and registered, the chain deploys to a Databricks Model Serving endpoint.

Real-World Scenario

Auto-Syncing a Product-Docs Vector Index

A team maintaining a product documentation portal can point a Delta Sync index at the product_docs Delta table written by their documentation pipeline. When writers publish new articles, the Spark job appends rows to the Delta table, and the index syncs automatically within minutes (TRIGGERED mode) or seconds (CONTINUOUS mode). No separate vector database ETL job is needed, and the index is always consistent with the source of truth.

O.4.6 Building RAG Applications on Databricks

The Databricks Lakehouse provides every component needed for a production RAG system: Delta Lake for document storage, PySpark for chunking and preprocessing (see Section 19.4 (Data Pipeline Tooling)), Foundation Model APIs for embedding and generation, Vector Search for retrieval, and MLflow for chain evaluation and deployment. This section assembles those components into a complete RAG pipeline, from raw document ingestion through to a deployed Model Serving endpoint. The RAG architecture itself is covered in depth in Chapter 32.

**Figure 19.5.1**: End-to-end RAG architecture on Databricks. All components except the user query live inside the Lakehouse, sharing Unity Catalog governance and MLflow lineage.

The ingestion pipeline (Code Fragment L.4.8 above) chunks documents with PySpark, writes chunks to a Delta table, and lets Vector Search handle embedding and indexing automatically.

The RAG chain itself (Code Fragment L.4.9 above) is a Python function wrapped as a pyfunc MLflow model. This makes it deployable to any MLflow-compatible endpoint, including Databricks Model Serving.

Key Insight

Wrapping the RAG chain as an MLflow pyfunc model provides more than just deployment convenience. It enables mlflow.evaluate() to run the full chain end-to-end against a labeled evaluation set, measuring retrieval quality (context recall, context precision) and generation quality (faithfulness, answer relevance) in one call. Evaluation with structured metrics is covered in Chapter 42.

Summary

Databricks provides a vertically integrated path from raw data to deployed AI application. Mosaic AI handles distributed training and managed fine-tuning, with MLflow capturing every experiment for reproducibility (see Weights & Biases). Foundation Model APIs give pay-per-token access to hosted open-weight models using the familiar OpenAI interface, with provisioned throughput available for latency-sensitive workloads. AI Functions bring LLM calls into SQL, making it practical to enrich data at ingestion time inside Delta Live Tables pipelines (see Section 19.4 (Data Pipeline Tooling)). MLflow Evaluate provides LLM-specific metrics for comparing model versions before promotion through the Unity Catalog Model Registry. Vector Search adds a managed retrieval layer that auto-syncs with source Delta tables, removing the need for a separate vector database in most RAG architectures. Finally, all these components compose naturally into end-to-end RAG applications that are deployed, versioned, and evaluated entirely within the Lakehouse, as detailed in Chapter 32 and the evaluation framework of Chapter 42.

What's Next?

In the next section, Section 19.6: External Reading & Communities, we build on the material covered here.

Further Reading

Base Models for Fine-Tuning

Touvron, H., et al. (2023). "Llama 2." arXiv:2307.09288. Reference open-weight base model.

Jiang, A. Q., et al. (2024). "Mixtral of Experts." arXiv:2401.04088. Reference open-weight MoE.

Qwen Team (2024). "Qwen3 Technical Report." arXiv:2412.15115. Reference for the Qwen3 model family.