Chapter 23: 3D Generation and Neural Scenes

Chapter opener illustration: 3D Generation and Neural Scenes.

"Three dimensions is two dimensions plus a lot of opinions about light."
Pixel, Splat-Curious AI Agent

Looking Back

Chapter 22 mapped the 2D vision world. This chapter goes one dimension up: NeRFs, Gaussian splatting, neural scene representations, and the LLM-driven prompting that lets you generate, edit, or relight a 3D scene from text.

Big Picture

3D Gaussian Splatting, NeRF, Stable Zero123, Trellis, 4D splats, and scene relighting.

Chapter Overview

3D generation crossed the productized threshold in 2024 and 2025. This chapter teaches the canonical primitives: 3D Gaussian Splatting fundamentals (math, training, COLMAP preprocessing), 4D dynamic-splat extensions, image-to-3D via multi-view diffusion (Zero123 and successors), direct 3D diffusion (Trellis, GaussianAnything, latent NeRF), and scene relighting plus 3D editing (IC-Light, NeRF-Editing, language-grounded manipulation).

3D Gaussian Splatting reshuffled the neural-scene stack in 18 months. This chapter is the production-ready picture as of 2026: which model to reach for, which workflow ships, and where the edges of what is reproducible sit today.

Note: Learning Objectives

Explain the math of 3D Gaussian Splatting and the COLMAP preprocessing pipeline.
Extend 3DGS to dynamic scenes using 4DGS, Deformable 3DGS, or Spacetime Gaussians.
Apply multi-view diffusion (Zero123, MVDream) to image-to-3D reconstruction.
Compare native 3D-diffusion approaches (Trellis, GaussianAnything) with multi-view diffusion pipelines.
Design scene relighting and language-grounded 3D editing workflows.

Prerequisites

Vision-language models from Chapter 22
Basic 3D-graphics literacy (meshes, lighting, cameras) helps but is not strictly required
Comfort with the modern multimodal API surface

Sections

What's Next?

Next: Chapter 24: Vision-Language-Action Models. Generating a 3D scene is one thing; acting in one is another. Chapter 24 covers the VLA frontier: RT-2, OpenVLA, pi-0, the action-tokenization trick that lets a transformer output robot trajectories, and cross-embodiment transfer (one model controlling many bodies). This is where multimodality stops being a perception problem and becomes a control problem.