
"Three dimensions is two dimensions plus a lot of opinions about light."
Pixel, Splat-Curious AI Agent
Chapter 22 mapped the 2D vision world. This chapter goes one dimension up: NeRFs, Gaussian splatting, neural scene representations, and the LLM-driven prompting that lets you generate, edit, or relight a 3D scene from text.
3D Gaussian Splatting, NeRF, Stable Zero123, Trellis, 4D splats, and scene relighting.
Chapter Overview
3D generation crossed the productized threshold in 2024 and 2025. This chapter teaches the canonical primitives: 3D Gaussian Splatting fundamentals (math, training, COLMAP preprocessing), 4D dynamic-splat extensions, image-to-3D via multi-view diffusion (Zero123 and successors), direct 3D diffusion (Trellis, GaussianAnything, latent NeRF), and scene relighting plus 3D editing (IC-Light, NeRF-Editing, language-grounded manipulation).
3D Gaussian Splatting reshuffled the neural-scene stack in 18 months. This chapter is the production-ready picture as of 2026: which model to reach for, which workflow ships, and where the edges of what is reproducible sit today.
- Explain the math of 3D Gaussian Splatting and the COLMAP preprocessing pipeline.
- Extend 3DGS to dynamic scenes using 4DGS, Deformable 3DGS, or Spacetime Gaussians.
- Apply multi-view diffusion (Zero123, MVDream) to image-to-3D reconstruction.
- Compare native 3D-diffusion approaches (Trellis, GaussianAnything) with multi-view diffusion pipelines.
- Design scene relighting and language-grounded 3D editing workflows.
Prerequisites
- Vision-language models from Chapter 22
- Basic 3D-graphics literacy (meshes, lighting, cameras) helps but is not strictly required
- Comfort with the modern multimodal API surface
Sections
- 23.1 3D Gaussian Splatting Fundamentals Mathematical foundations, training, COLMAP preprocessing, and dynamic splat extensions. Entry
- 23.2 4D & Dynamic Splats Temporal extensions of 3D Gaussian Splatting: 4DGS, Dynamic 3D Gaussians, Deformable 3DGS, Spacetime Gaussians. Intermediate
- 23.3 Image-to-3D: Stable Zero123 & Multi-View Diffusion Zero123, Stable Zero123, MVDream, and the multi-view diffusion paradigm for single-image 3D reconstruction. Intermediate
- 23.4 Direct 3D Diffusion: Trellis & Structured Latents Trellis, GaussianAnything, Latent NeRF, and the wave of native 3D generative models. Advanced
- 23.5 Scene Relighting & 3D Editing Inverse rendering, IC-Light, NeRF-Editing, and language-grounded 3D scene manipulation. Advanced
What's Next?
Next: Chapter 24: Vision-Language-Action Models. Generating a 3D scene is one thing; acting in one is another. Chapter 24 covers the VLA frontier: RT-2, OpenVLA, pi-0, the action-tokenization trick that lets a transformer output robot trajectories, and cross-embodiment transfer (one model controlling many bodies). This is where multimodality stops being a perception problem and becomes a control problem.