Section 23.1: 3D Gaussian Splatting Fundamentals

"NeRF gave us photoreal 3D at the cost of forty minutes per frame. Gaussians gave it back at 120 fps."
Pixel, Splat-Curious AI Agent

Big Picture

3D Gaussian Splatting (3DGS) replaced Neural Radiance Fields (NeRFs) as the default representation for photoreal scene reconstruction in late 2023, and by 2026 it powers nearly every shipping product in volumetric capture, AR scene scanning, and real-time 3D synthesis. A scene becomes a cloud of millions of anisotropic Gaussians, each with position, covariance, opacity, and view-dependent color expressed as spherical harmonics. Rendering is a sort-and-blend operation on the GPU rasterizer rather than expensive ray marching, which lets the same scene render at over 100 fps where a comparable-quality NeRF needed seconds per frame.

Why this lives in an LLM and agents book. Modern multimodal LLMs and embodied agents increasingly need to see, navigate, and reason about 3D space. Vision-language-action models (Chapter 24) operate on splat-derived scene maps; spatial-reasoning agents query 3DGS scenes the way text agents query a vector store; world-model agents (Section 40.4) and robotics policies (Chapter 24) use NeRF or 3DGS as their differentiable simulator. The optimization itself shares its bones with the gradient descent of Chapter 1, but the parameters are 3D Gaussians instead of weights of an MLP. This section walks through the math, the training loop, COLMAP preprocessing, and the first extensions toward dynamic and language-conditioned splats.

Prerequisites

This section assumes comfort with camera matrices and basic linear algebra (covariance, eigendecomposition). Familiarity with PyTorch gradient flow from Section 0.3 helps when reading the training loop.

Side-by-side comparison: a NeRF MLP querying a continuous volumetric function vs. a 3D Gaussian Splatting scene as an explicit cloud of ellipsoids being rasterized — **Figure 23.1.1**: NeRF queries an implicit MLP for every point along every ray (millions of forward passes per image). 3D Gaussian Splatting represents the scene as an explicit, sortable cloud of anisotropic Gaussians that the GPU rasterizer can paint to the framebuffer in a single pass. The functional change from implicit to explicit is what unlocks real-time rendering.

23.1.1 From NeRF to Splats: The Representation Shift

Fun Fact

3D Gaussian Splatting won the SIGGRAPH 2023 best-paper award and within six months had spawned more arXiv follow-ups than NeRF managed in two years. The reason was simple: NeRF rendering took seconds per frame, while splats rendered at 100+ FPS on a laptop GPU. The graphics community had spent decades arguing that machine learning was too slow for real-time rendering, and then a paper using a 200-year-old idea (Gaussian blobs, attributed to Carl Friedrich himself) quietly walked in and ended the argument.

A Neural Radiance Field, introduced by Mildenhall et al. (2020), trains an MLP to map a 5D coordinate (3D position plus 2D viewing direction) to a 4D output (RGB color plus volumetric density). Rendering a single pixel requires sampling dozens to hundreds of points along the camera ray, evaluating the MLP at each, and integrating the radiance with quadrature. Quality is excellent but rendering is slow: an unaccelerated NeRF takes seconds per 800x800 frame and minutes to hours to train.

3D Gaussian Splatting, introduced by Kerbl et al. (2023) at SIGGRAPH 2023, throws away the MLP. The scene is represented by a set of explicit 3D Gaussians, each parameterized by:

A 3D position $\mu \in \mathbb{R}^3$ (the mean).
A 3x3 covariance matrix $\Sigma$, factored as $\Sigma = R S S^\top R^\top$ where $R$ is a rotation (stored as a quaternion) and $S$ is a diagonal scale.
An opacity scalar $\alpha \in [0, 1]$.
A view-dependent color, typically encoded as the coefficients of low-order spherical harmonics (SH) of degree 0 to 3.

The probability density of one Gaussian at point $x$ is the familiar form $G(x) = \exp\!\big({-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)}\big)$. To render, each 3D Gaussian is projected to a 2D image-space Gaussian using the affine approximation of the perspective projection, splatted on the framebuffer, and alpha-composited front-to-back after sorting by depth.

Key Insight: Why sort-and-blend beats ray marching

A NeRF integrates radiance per ray. A 3DGS rasterizer integrates radiance per pixel by accumulating contributions from sorted Gaussians until opacity saturates. Both implement the same volume rendering equation, but 3DGS pays per Gaussian (a few million), while NeRF pays per sample (hundreds of millions per frame). When the geometry is sparse, which is true of most real scenes, the explicit representation wins by two orders of magnitude.

23.1.2 The Differentiable Rasterizer

For a pixel at image coordinate $u$, the rendered color is an alpha-composite over the depth-sorted Gaussians that overlap the pixel:

$$ C(u) = \sum_{i \in N} c_i\, \alpha_i'(u) \prod_{j=1}^{i-1} \big(1 - \alpha_j'(u)\big) $$

where $c_i$ is the SH-evaluated color of Gaussian $i$ in the viewing direction, and $\alpha_i'(u) = \alpha_i \cdot \exp\!\big({-\frac{1}{2}(u-\mu_i')^\top \Sigma_i'^{-1}(u-\mu_i')}\big)$ is the 2D Gaussian footprint scaled by the learned opacity. The projection $\Sigma_i'$ is computed by the Jacobian of the perspective transform; the original Kerbl et al. paper derives this in closed form and the CUDA kernel evaluates it in a few tens of FLOPs per pixel-Gaussian pair.

Because every operation, projection, exponential, alpha compositing, is differentiable, gradients flow from the rendered pixel back to every Gaussian parameter. This is the same trick that powers neural rendering in general; the novelty is that the primitive is an explicit Gaussian instead of a queried MLP.

Parameter	Shape per Gaussian	Total for 1M Gaussians	Notes
Position $\mu$	3 floats	12 MB	FP32 by default
Rotation (quaternion)	4 floats	16 MB	Normalized after each step
Scale (log)	3 floats	12 MB	Exponentiated to enforce positivity
Opacity (logit)	1 float	4 MB	Sigmoid to $[0, 1]$
SH degree 3	48 floats (3 channels x 16 coeffs)	192 MB	Largest term; SH degree 0 is just RGB

Table 23.1.2: Parameter budget for a 1-million-Gaussian scene. The SH coefficients dominate memory; many production pipelines (e.g., LumaAI) ship SH degree 1 or 2 to halve VRAM with marginal quality loss.

23.1.3 The Training Loop: Initialization, Loss, Densification

Training a 3DGS scene starts from a sparse point cloud, evolves it through gradient descent, and densifies or prunes Gaussians along the way. The pseudocode below captures the essential loop from the reference implementation.

# 3DGS training loop, simplified. Real code lives in
# https://github.com/graphdeco-inria/gaussian-splatting
import torch
from gsplat import rasterization

gaussians = GaussianModel.from_point_cloud(colmap_points)  # ~100k init
optimizer = torch.optim.Adam(gaussians.parameters(), lr=1.6e-4)

for step in range(30_000):
    cam, gt_image = dataset.sample_view()          # random training view
    render, viewspace, visibility = rasterization(
        gaussians.means, gaussians.quats, gaussians.scales,
        gaussians.opacities, gaussians.colors,
        viewmat=cam.world_to_cam, K=cam.intrinsics,
        width=cam.W, height=cam.H,
    )
    # L1 + SSIM is the standard 3DGS photometric loss
    loss = (1.0 - lam) * l1(render, gt_image) + lam * (1.0 - ssim(render, gt_image))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

    # Adaptive density control: split big Gaussians with high grad,
    # clone small ones, prune low-opacity.
    if step % 100 == 0 and step < 15_000:
        gaussians.densify_and_prune(
            grad_threshold=0.0002,
            min_opacity=0.005,
            max_screen_size=20,
        )

Code Fragment 23.1.1a: The reference 3DGS training loop. Three things matter: the L1 + SSIM photometric loss (with $\lambda = 0.2$), the differentiable rasterization step from gsplat (the BSD-licensed re-implementation by Nerfstudio), and the adaptive density control that splits, clones, or prunes Gaussians based on their accumulated screen-space gradient.

Note: Why densification, not just optimization

If you simply ran Adam on the initial point cloud you would get a blurry mess: there are not enough Gaussians to represent fine geometry. The adaptive density control monitors the screen-space gradient of every Gaussian. A high gradient indicates the Gaussian is straddling a feature it cannot represent: if it is small, the algorithm clones it (duplicate at the same location); if it is large, the algorithm splits it (replace with two smaller Gaussians along the gradient direction). Opacity below a threshold triggers pruning. This produces a roughly self-organizing point density that matches scene complexity.

23.1.4 COLMAP and the Camera-Pose Bootstrap

Every 3DGS pipeline starts with a structure-from-motion (SfM) pass that recovers camera poses and a sparse point cloud from input images. The de facto tool is COLMAP by Schoenberger and Frahm. The pipeline is:

Feature detection with SIFT on each image.
Pairwise matching (exhaustive for small captures, vocab-tree for larger ones).
Incremental SfM that triangulates a sparse 3D point cloud while jointly optimizing camera poses through bundle adjustment.
Optional dense MVS, though 3DGS does not require dense depth.

# Minimal COLMAP CLI invocation for a 3DGS capture.
# Assumes ~100 photos in ./input/ shot from varied angles.
colmap feature_extractor \
    --database_path scene.db \
    --image_path ./input/ \
    --ImageReader.single_camera 1

colmap exhaustive_matcher --database_path scene.db

mkdir sparse
colmap mapper \
    --database_path scene.db \
    --image_path ./input/ \
    --output_path ./sparse

# Convert to the format 3DGS expects (cameras.bin, images.bin, points3D.bin)
colmap model_converter \
    --input_path ./sparse/0 \
    --output_path ./sparse/0 \
    --output_type BIN

Code Fragment 23.1.2a: COLMAP CLI pipeline that produces the camera intrinsics, extrinsics, and seed point cloud expected by every reference 3DGS trainer. On a 100-image capture this runs in 5 to 30 minutes on CPU; GPU SIFT and vocab-tree matching cut that to a couple of minutes.

Warning: COLMAP failure modes

The single biggest source of bad 3DGS reconstructions is bad camera poses. COLMAP can silently lose track of a chunk of your input if you have low texture (white walls), motion blur, rolling-shutter artifacts, or images shot from too similar a viewpoint. Always inspect sparse/0 before training: if half your images are missing, your splats will collapse or float. Newer alternatives like Nerfstudio's ns-process-data wrap COLMAP with HLOC-based matching and produce noticeably more robust poses on hard captures.

23.1.5 Rendering, Viewers, and Web Export

A trained 3DGS scene is a single PLY file with one row per Gaussian. The community has converged on the .splat / .ksplat formats for compressed web delivery. Real-time viewers run in WebGL: see antimatter15/splat for a minimal viewer, PlayCanvas Supersplat for an editor, and Nerfstudio's gsplat-based viewer for a production-quality experience. Browser rendering of a 1M-Gaussian scene at 1080p typically clears 60 fps on a 2022 MacBook.

For production deployment, the LumaAI Luma Web Library wraps splat playback in a React component; Polycam, KIRI Engine, and Postshot offer mobile-to-cloud capture-train-export pipelines.

23.1.6 Dynamic Splats: A Preview of Section 23.2

Static 3DGS captures a frozen scene. By 2024 the field had moved to dynamic splats where each Gaussian is also a function of time. The dominant approaches are:

4DGS (Wu et al., 2024) attaches a small MLP to each Gaussian that predicts a per-timestep position and rotation delta. Memory grows linearly with Gaussian count but is independent of clip length.
Dynamic 3D Gaussians (Luiten et al., 2024) explicitly tracks each Gaussian's position over time and adds isotropic, local-rigidity, and rotational regularizers that prevent it from drifting through scene structure.
Deformable 3DGS (Yang et al., 2024) uses a shared deformation field MLP conditioned on canonical position and time, similar to D-NeRF.

Section 23.2 picks this up in detail; here it suffices to note that the static 3DGS infrastructure (rasterizer, optimizer, densifier) transfers cleanly to the dynamic case with relatively small changes to the per-Gaussian state.

A timeline diagram of dynamic Gaussian splat evolution: 3DGS (2023), 4DGS (2024), Dynamic 3D Gaussians (2024), 4D-Rotor and Spacetime Gaussians (2024-2025) — **Figure 23.1.2b**: The dynamic-splat ecosystem in 18 months. Each variant trades off memory, per-frame quality, and supported motion magnitude. Section 23.2 compares them quantitatively.

Key Insight

3D Gaussian Splatting is the 2024-2026 default for photoreal 3D reconstruction. An explicit cloud of millions of anisotropic Gaussians with view-dependent SH color is differentiably rasterized by a CUDA kernel that runs at 100+ fps. Training is end-to-end gradient descent on an L1 + SSIM photometric loss with adaptive density control. The only non-trivial preprocessing is COLMAP-derived camera poses. Everything in the rest of this chapter, dynamic splats, image-to-3D, scene relighting, builds on this primitive.

Self-Check

Q1: Why is the alpha-compositing equation in 3DGS mathematically equivalent to NeRF volume rendering, yet 100x faster in practice?

Show Answer

Both implement the same volume-rendering integral $C = \sum_i T_i \alpha_i c_i$ with $T_i = \prod_{j. The difference is what you pay per ray: NeRF samples 64-256 query points along each ray and runs an MLP at every one (hundreds of millions of MLP forward passes per frame), while 3DGS rasterizes a sparse set of explicit primitives, sorts by depth once per tile, and accumulates contributions on the GPU. Sparsity of real scenes (most rays hit empty space) is what makes the explicit form win by two orders of magnitude.

Q2: What goes wrong if you initialize 3DGS from a uniform random point cloud instead of COLMAP's sparse SfM points?

Show Answer

Convergence collapses or diverges. COLMAP points lie on real scene surfaces, so the Gaussians start near the right opacities and the gradient signal teaches them to refine. Uniformly random Gaussians sit mostly in empty space; the gradient pulls them toward surfaces but the path is long and unstable, and the adaptive-density controller will spend most of its budget pruning the bad initializations rather than building structure. Practical workarounds (random + heavy regularization, or learned monocular-depth initializers) exist but always lag a proper SfM cloud in PSNR.

Q3: The SH degree-3 coefficients dominate memory. When can you safely drop to SH degree 0 (constant RGB per Gaussian)?

Show Answer

When the scene has little view-dependent appearance: matte indoor scenes under diffuse lighting, foreground objects with no specular highlights, or any setting where you tolerate slightly muted reflections in exchange for a 16x reduction in per-Gaussian color memory. SH-3 (16 coefficients per channel) is what lets glossy surfaces and metallics shimmer; SH-0 (3 floats per Gaussian) is enough for matte. Many shipping AR pipelines start at SH-0 and only upgrade to SH-2 or SH-3 if the qualitative review demands it.

Q4: Why does the adaptive density controller need both a "clone" and a "split" operation? What does each fix that the other does not?

Show Answer

Clone duplicates an existing Gaussian in place to fix under-reconstruction: large gradient magnitude on a small Gaussian means the surface is finer-grained than the primitive can represent, so we add a second Gaussian and let optimization tease them apart. Split replaces one large Gaussian with two smaller children (scaled down) to fix over-reconstruction: large gradient magnitude on a large Gaussian means the primitive is straddling two surface features, so we cut it in half. Clone adds capacity where detail is missing; split refines where geometry is conflated. You need both because the two failure modes have opposite causes.

What Comes Next

Section 23.2: 4D and Dynamic Splats extends the static formulation to time. We will look at how 4DGS, Dynamic 3D Gaussians, and Spacetime Gaussians handle moving cameras, articulated humans, and the long-tail problems of motion priors and temporal regularization.

Further Reading

Foundational

Kerbl, B., Kopanas, G., Leimkuhler, T., & Drettakis, G. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics (SIGGRAPH). arXiv:2308.04079

Mildenhall, B., Srinivasan, P. P., Tancik, M., et al. (2020). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV. arXiv:2003.08934

Implementations (2024 to 2026)

Ye, V., Li, R., Kerr, J., et al. (2024). "gsplat: An Open-Source Library for Gaussian Splatting." Nerfstudio. github.com/nerfstudio-project/gsplat

Tancik, M., Weber, E., Ng, E., et al. (2023). "Nerfstudio: A Modular Framework for Neural Radiance Field Development." ACM SIGGRAPH. arXiv:2302.04264

Camera-Pose Preprocessing

Schoenberger, J. L., & Frahm, J. M. (2016). "Structure-from-Motion Revisited." CVPR. (COLMAP system paper.) demuc.de/papers/schoenberger2016sfm.pdf

2025 Surveys

Chen, G., & Wang, W. (2024). "A Survey on 3D Gaussian Splatting." arXiv. arXiv:2401.03890

Fei, B., Xu, J., Zhang, R., et al. (2024). "3D Gaussian Splatting as New Era: A Survey." arXiv. arXiv:2402.07181