Section 23.4: Direct 3D Diffusion , Trellis & Structured Latents

"Stop rendering 2D shadows of 3D things. Diffuse the 3D thing."
Pixel, Trellis-Trained AI Agent

Big Picture

Section 23.3 covered the multi-view-then-lift pipelines. This section covers their replacement: native 3D latent diffusion. Instead of asking a 2D diffusion model to imagine novel views and then lifting those views into 3D, you train a diffusion model that operates directly in a 3D latent space. Microsoft Research's Trellis (Xiang et al., 2024) is the canonical example: it diffuses a sparse, structured 3D latent that can be decoded into a Gaussian splat, a radiance field, or a mesh. GaussianAnything, LN3Diff, and 3DTopia complete the family. The payoff is faster sampling (no per-view denoising), better multi-view consistency (consistency is built into the representation), and a unified output that downstream graphics tools can ingest. For an LLM and agent stack, native-3D diffusion matters because multimodal LLMs increasingly call 3D generation models as tools: a single agent turn might ask GPT-4 to describe a scene, then call Trellis to produce the geometry, then feed the result into a vision-language model that verifies the spatial layout. The faster and more consistent 3D generation becomes, the more directly an LLM-driven agent can compose 3D content within a single multi-step plan.

Prerequisites

This section assumes the diffusion-model fundamentals from Section 19.7, the latent-space autoencoder intuition from Section 19.6, and the 3DGS representations from Section 23.1.

Diagram contrasting multi-view-then-lift pipelines with direct 3D latent diffusion: the latter samples a 3D latent volume and decodes to Gaussians, NeRF, or mesh — **Figure 23.4.1**: Direct 3D diffusion samples in a 3D latent space and decodes once. Multi-view-then-lift pipelines sample $N$ views and optimize 3D parameters to match, which is both more expensive and more failure-prone.

23.4.1 The Case for Native 3D Latents

Fun Fact

Microsoft's Trellis (2024) was released alongside a public demo where users could text-prompt a 3D object and get a downloadable Gaussian splat in roughly 10 seconds. The demo crashed within hours of launch because the queue depth blew past every capacity assumption; the model itself was solid, but the inference fleet was sized for a 1980s academic demo and met a 2024 viral-launch reality.

The deep reason direct 3D diffusion works is the same as the deep reason latent diffusion (Section 31.1) works for images: most of the bits in a raw 3D representation are redundant. A 3D Gaussian splat with one million Gaussians takes ~250 MB raw, but a learned encoder can compress most objects to a few tens of thousands of floats with no perceptual loss. Once you have such a 3D autoencoder, you can train a latent diffusion model that operates on the compressed representation. Inference is a single (latent) diffusion run plus a decoder pass, no multi-view rendering required.

The challenges are 3D-specific:

Choice of representation: occupancy grid, triplane, sparse voxel, or point cloud. Each has different memory and inductive bias.
Variable-shape inputs: meshes have different vertex counts, point clouds have different point counts. The latent has to be regular.
Decoder versatility: production users want both Gaussian splats (for AR) and meshes (for downstream DCC tools like Blender or Unreal). The decoder must support both.

23.4.2 Trellis: Structured 3D Latents (Microsoft Research, 2024)

Trellis tackles all three challenges with a structured latent design. Each 3D object is encoded as a sparse voxel grid of feature vectors at multi-scale resolution. Empty voxels are dropped (sparsity); non-empty voxels carry a high-dimensional feature. A transformer-based decoder maps these features to whatever output representation you need.

The training stack has three components:

3D VAE: encodes an object as a sparse latent voxel grid (typically 64³ sparse cells with ~1k non-empty cells per object). The encoder takes posed multi-view renders as input; the decoder produces either Gaussians, radiance fields, or signed distance fields for mesh extraction.
Sparse 3D diffusion transformer (DiT-3D): a diffusion model with sparse 3D attention that operates on the structured voxel latents. Trained on the full Objaverse-XL + curated Bemis collection (~500k objects).
Multi-decoder heads: independent decoders for Gaussian splat, radiance field, and SDF outputs. All share the same latent backbone.

# Trellis inference via the released pipeline (Microsoft/TRELLIS).
# A single image yields three different 3D representations.
from trellis.pipelines import TrellisImageTo3DPipeline
from PIL import Image

pipe = TrellisImageTo3DPipeline.from_pretrained(
    "JeffreyXiang/TRELLIS-image-large"
)
pipe.cuda()

image = Image.open("sneaker.png").convert("RGBA")
outputs = pipe.run(
    image,
    seed=42,
    formats=["gaussian", "radiance_field", "mesh"],
    sparse_structure_sampler_params={
        "steps": 12,
        "cfg_strength": 7.5,
    },
    slat_sampler_params={
        "steps": 12,
        "cfg_strength": 3.0,
    },
)

# Three representations from one inference run
outputs["gaussian"][0].save_ply("sneaker.ply")
outputs["mesh"][0].export("sneaker.glb")
# Radiance field can be rendered with the bundled viewer

Code Fragment 23.4.1a: Trellis runs in two diffusion stages: first a coarse sparse-structure sampler that decides which voxels are occupied, then a structured-latent (SLAT) sampler that fills those voxels with features. Total runtime on an A100 is ~10 seconds for all three output formats.

Key Insight: Two-stage diffusion mirrors a cascaded VAE

The two-stage Trellis sampler (occupancy then features) is closely related to Section 31.1's discussion of cascaded diffusion. The first stage decides the coarse shape (where the object exists), the second fills in the details. This factorization gives Trellis its speed advantage: occupancy is a cheap binary problem at 32³ resolution; features only need to be computed where occupancy is non-zero.

23.4.3 GaussianAnything: Latent Gaussian Diffusion (2024)

GaussianAnything (Lan et al., 2024) takes a different cut. Instead of voxel features, the latent is a fixed-size set of Gaussian primitives: 4096 Gaussians whose parameters are themselves diffused. The encoder takes multi-view renders and predicts a canonical 4096-Gaussian latent; the diffusion model learns the distribution over these latents.

The advantage is that the decoder is trivial (just unpack the Gaussians). The disadvantage is that 4096 Gaussians is too few for fine details; GaussianAnything ships an optional upsampling stage that produces ~50k Gaussians from the latent set.

23.4.4 Latent NeRF and the Triplane Line

The other major line is triplane-latent diffusion. EG3D (Chan et al., 2022) introduced the triplane representation: three orthogonal 2D feature planes (XY, XZ, YZ) whose values can be looked up at any 3D point via trilinear interpolation. Diffusing in triplane space gives you a 2D-friendly latent (three 256x256x32 grids) that decodes to a 3D NeRF.

SSDNeRF (Chen et al., 2024) trains a single-stage encoder-decoder that diffuses triplanes.
LN3Diff (Lan et al., 2024) adds a hierarchical structure to triplane diffusion, enabling text-to-3D and image-to-3D from the same backbone.
3DTopia (Hong et al., 2024) cascades a triplane diffusion stage with an SDS refinement stage, balancing speed and quality.

Model	Latent Type	Outputs	Inference Time (A100)	Strengths
Trellis	Sparse structured voxel	Gaussian, RF, mesh	~10 s	Output flexibility, high quality
GaussianAnything	4096 canonical Gaussians	Gaussian splat	~3 s	Fastest, native to 3DGS
LN3Diff	Hierarchical triplane	NeRF, mesh	~6 s	Text-to-3D from same backbone
3DTopia	Triplane + SDS refine	NeRF, mesh	~60 s	Highest geometric fidelity
Direct3D (2024)	Latent volume	SDF, mesh	~8 s	Clean mesh topology

Figure 23.4.2: Comparison of late-2024 to 2025 direct 3D diffusion models. The output column matters most for practitioners: pick based on whether your downstream tool wants Gaussians (Trellis, GaussianAnything) or meshes (Direct3D, 3DTopia).

23.4.5 Text Conditioning and Style Control

Trellis and LN3Diff both support text conditioning through cross-attention to a frozen text encoder (T5-XXL in Trellis, CLIP in LN3Diff). The pattern is identical to text-to-image diffusion: text tokens become keys and values, the diffusion latent becomes queries. Text-to-3D quality has lagged image-to-3D throughout 2024 to 2025; the open problem is that 3D training corpora are perhaps 1000x smaller than text-image pairs and far less diverse, so the model has trouble with compositional prompts.

Style and material control remain underdeveloped. The current pattern is to generate geometry with Trellis or similar, then re-texture using a 2D diffusion model (e.g., Paint3D, TextureDreamer) that paints a UV-unwrapped mesh. Section 23.5 covers the related problem of scene relighting.

Real-World Scenario: Game Asset Generation Pipeline

A 2025 indie studio used Trellis as the core of a procedural prop pipeline. Designers wrote text prompts ("rusted iron lantern with cracked glass") and Trellis produced a Gaussian splat + a textured glTF mesh in ~10 seconds. The glTF was imported into Unreal Engine 5 with auto-generated LODs; the Gaussian splat was used for marketing renders. Average artist time per asset dropped from 4 hours to 8 minutes, with the artist's role shifting to prompt engineering and quality review.

23.4.6 Where the Field Is Heading (2025 to 2026)

Three trends define the next 18 months:

Scaling laws for 3D: early evidence (Trellis vs. its predecessors) suggests power-law improvements with both data and parameters, mirroring Chapter 6's text scaling laws. Expect 10B-parameter 3D diffusion models by 2026.
Native rectified flow: like SD3 and Flux, the 3D diffusion world is moving from DDPM training to rectified flow / flow matching for sharper outputs at fewer sampling steps.
Scene-scale generation: Trellis-style models are object-scale. Scene-scale equivalents (BlockFusion, SceneDreamer-2) are early but improving rapidly; expect a unified scene-and-object model within the year.

Note: Pick your representation, then pick your model

The single most important decision when choosing a 3D generative model is what your downstream consumer wants. A web AR app wants a Gaussian splat (GaussianAnything, Trellis-gaussian). A game engine wants a glTF mesh with UV unwrap and PBR materials (Trellis-mesh, SF3D, Direct3D). A research project wants a NeRF (LN3Diff, SSDNeRF). The same input image can produce all of these; pick the model whose decoder matches your output target.

Key Insight

Direct 3D diffusion replaces multi-view-then-lift with single-pass 3D latent sampling. Trellis defines the state of the art with structured sparse voxel latents and a multi-decoder head that yields Gaussians, radiance fields, or meshes from one model. GaussianAnything is the fastest native-Gaussian option; triplane-based variants (LN3Diff, 3DTopia) dominate the NeRF and mesh outputs. The field is following the same scaling and rectified-flow trends as 2D image generation, with object-scale models far ahead of scene-scale.

Self-Check

Q1: Trellis splits inference into a sparse-structure stage and a structured-latent stage. Why is this faster than diffusing the full feature volume in one shot?

Show Answer

The first stage is a small binary diffusion at coarse resolution (typically 32 cubed) that decides which voxels are occupied; this is a cheap problem with sparse, low-dimensional outputs. The second stage only fills features into the small subset of voxels marked occupied (often around 1k cells out of 64 cubed). Dense feature diffusion at the full 64 cubed grid would touch over 260,000 cells at every step, whereas the two-stage cascade evaluates the expensive feature transformer on only the populated cells. The factorization mirrors cascaded 2D diffusion: coarse layout first, then details.

Q2: GaussianAnything uses 4096 Gaussian primitives as its latent. Why does it then need an upsampling stage, and what does that imply about the limits of the latent diffusion approach?

Show Answer

4096 Gaussians is enough to capture coarse shape but too few for fine details like fabric weave, small text, or thin protrusions; the model would either smear those out or sacrifice global coverage. The upsampling stage produces around 50k Gaussians from the latent set, recovering fidelity. The broader implication is that fixed-size latents trade off capacity against compute; if a class of objects needs richer detail, you need either a larger latent (slower diffusion) or a learned upsampler (the GaussianAnything path), and the upsampler must be trained jointly so it does not introduce its own artifacts.

Q3: If you need a glTF mesh for a game engine pipeline, which of the models in Figure 23.4.2a fit, and how would you choose between them?

Show Answer

Trellis (mesh decoder), 3DTopia, and Direct3D all produce mesh outputs. Choose Trellis when you also want a Gaussian splat or radiance field from the same latent, since it shares the backbone across all three. Choose Direct3D when you need clean mesh topology with predictable triangle counts and watertight surfaces for collision and physics. Choose 3DTopia when geometric fidelity is paramount and you can afford the around 60-second SDS refinement stage. For real-time iteration cycles where artists prompt and review quickly, Trellis at around 10 seconds is the practical default.

Q4: What is the analog of latent-image diffusion's "image VAE" in the 3D world, and what makes the 3D version harder to train?

Show Answer

The analog is a 3D VAE that encodes a 3D object (mesh, point cloud, or multi-view renders) into a compact regular latent and decodes back to a 3D representation. The 3D version is harder for three reasons: input formats are non-uniform (variable vertex counts, point counts, view counts), there is no single canonical 3D ground truth (meshes, SDFs, and splats are mutually convertible only with loss), and 3D training corpora are roughly 1000x smaller and less diverse than image corpora. As a result, 3D VAEs reconstruct less faithfully than image VAEs and the latent diffusion stage inherits some of that ceiling.

What Comes Next

Section 23.5: Scene Relighting and 3D Editing closes the chapter with the post-generation problem: once you have a 3D scene, how do you change its lighting, edit its geometry, or composite it into a new environment?

Further Reading

Trellis and Direct 3D Diffusion

Xiang, J., Lv, Z., Xu, S., et al. (2024). "Structured 3D Latents for Scalable and Versatile 3D Generation" (Trellis). Microsoft Research. arXiv:2412.01506

Wu, S., Lin, Y., Zhang, F., et al. (2024). "Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer." NeurIPS. arXiv:2405.14832

Gaussian and Triplane Latents

Lan, Y., Hong, F., Yang, S., et al. (2024). "GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation." arXiv. arXiv:2411.08033

Lan, Y., Tan, F., Qiu, D., et al. (2024). "LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation." ECCV. arXiv:2403.12019

Hong, F., Tang, J., Cao, Z., et al. (2024). "3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors." arXiv. arXiv:2403.02234

Foundational

Chan, E. R., Lin, C. Z., Chan, M. A., et al. (2022). "Efficient Geometry-aware 3D Generative Adversarial Networks" (EG3D triplane). CVPR. arXiv:2112.07945

Texture and Style

Cao, T., Kreis, K., Fidler, S., et al. (2023). "TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models." ICCV. arXiv:2310.13772