Appendix T: Distributed ML: PySpark, Databricks, and Ray | Building Conversational AI with LLMs and Agents

Big Picture

Training and serving large language models often exceeds the capacity of a single machine. Distributed ML encompasses the techniques, frameworks, and infrastructure needed to split computation across multiple GPUs and nodes. This includes data parallelism (replicating the model, splitting the data), model parallelism (splitting the model across devices), pipeline parallelism (splitting by layer groups), and hybrid strategies that combine all three.

This appendix is organized around three pillars. It begins with PySpark for large-scale text preprocessing, deduplication, and embedding generation, then covers Databricks as a unified lakehouse platform: workspace setup, Delta Lake storage, Unity Catalog governance, and the Mosaic AI suite for foundation model training, serving, and vector search. Ray provides a complementary framework-agnostic compute layer with Ray Train, Ray Serve, and Ray Data. Cross-cutting topics (feature stores, production pipelines, and observability) tie the platforms together into end-to-end ML workflows.

This appendix is for ML engineers and infrastructure teams working at scales that require multi-GPU or multi-node training, or organizations that need managed platforms for end-to-end ML workflows including data lakes, feature stores, and model registries.

The pretraining-at-scale concepts that motivate distributed training are covered in Chapter 6 (Pretraining and Scaling Laws). Inference optimization techniques that complement distributed serving are in Chapter 9 (Inference Optimization). For GPU hardware fundamentals, memory hierarchies, and compute budgeting, see Appendix G (Hardware and Compute).

Prerequisites

Read Chapter 6 (Pretraining and Scaling Laws) to understand why large models require distributed training and how scaling laws govern compute allocation. Appendix G (Hardware and Compute) explains GPU memory, interconnects (NVLink, InfiniBand), and cost estimation, all of which inform parallelism strategy choices. Familiarity with PyTorch training loops and basic cluster/cloud concepts (SSH, job schedulers) is assumed.

When to Use This Appendix

Consult this appendix when your model or dataset no longer fits on a single GPU, when training time on one machine is prohibitive, or when your organization needs a managed platform for the full ML lifecycle. Choose Databricks when you need a unified data + ML platform with governance (Unity Catalog), collaborative notebooks, and managed Spark clusters. Choose Ray when you want framework-agnostic distributed compute that works across training, serving, and data processing. For lower-level parallelism within PyTorch, use DeepSpeed or FSDP directly. If you only need to serve (not train) models at scale, start with Appendix S (Inference Serving) instead.

Sections