Section 24.9: VoxPoser: Language as Spatial Cost Field

"The map is the cost field, and the cost field is the language. Move accordingly."
Compass, Voxel-Grid-Native AI Agent

Big Picture

VoxPoser (Huang et al., 2023, arXiv:2307.05973) took a different path from SayCan and Code-as-Policies. Instead of producing a sequence of skill calls, an LLM generates Python code that fills a 3D voxel grid with attractive and repulsive cost values, and then a classical trajectory optimizer plans a path through the cost field. The result is a planner that does not need a skill library at all: the LLM controls the cost field, the optimizer controls the motion, and any task expressible as "go to high-reward voxels while avoiding high-cost ones" can be planned without per-skill data.

Prerequisites

This section assumes the LLM-grounded planning pattern from Section 24.7 and Section 24.8, plus the basic 3D scene-representation vocabulary from Section 23.1.

24.9.1 The VoxPoser Idea in One Paragraph

Fun Fact

VoxPoser (Huang et al., 2023) was the first published work to use an LLM as a cost-field designer rather than a skill picker. The original demo had GPT-4 generating Python that called functions like get_voxel_mass("the apple"), and the trajectory optimizer was a stock CHOMP-style planner from a 2009 paper. The headline was "language as cost field"; the engineering surprise was that classical motion planners suddenly looked smart when the LLM gave them the right objective.

Discretize the workspace into a 3D voxel grid, say 100 by 100 by 100 voxels of 1 cm each. For a given instruction, an LLM writes Python code that populates two grids: a target-affordance grid A(x, y, z) with positive values at voxels the end-effector should reach, and a cost grid C(x, y, z) with positive values at voxels it should avoid. A motion planner then finds the trajectory that maximizes integrated affordance minus integrated cost. The trick is that "high-reward voxels" and "high-cost voxels" can be language-grounded: the LLM can call object-detection primitives like get_voxel_mass("the apple") to populate the grids based on what is in the scene.

Key Insight: Cost fields are skill-free

VoxPoser's contribution is the realization that you can specify a manipulation task entirely by a cost field, without naming any "pick" or "place" skill. "Put the apple in the bowl" becomes "attractive voxels at the apple, then attractive voxels at the bowl, repulsive voxels at obstacles", and a generic trajectory optimizer handles the motion. The LLM is no longer the planner; it is the cost-field designer. This is a meaningfully different decomposition from SayCan (LLM ranks skills) or Code-as-Policies (LLM calls skill APIs); the skills themselves have been collapsed into the optimization objective.

24.9.2 The Reference Pipeline

VoxPoser's pipeline has five stages. Each stage swaps a familiar robotics component for an LLM-driven equivalent, but only the second stage (cost-field generation) is novel; the rest is classical robotics literature. The total round-trip latency in Huang et al.'s 2023 reference implementation was about 4 seconds for the LLM call plus 1.5 seconds for trajectory optimization on a single A100, which sets the practical floor for "interactive" use.

Perception: detect objects in the scene and lift them into 3D positions using an off-the-shelf VLM (CLIP, OWL-ViT) plus depth from a stereo camera or learned monocular depth estimator. The original paper used OWL-ViT-Base (Minderer et al., 2022) with a RealSense D435 depth camera; segmentation latency is roughly 200 ms per frame and dominates the perception budget.
LLM cost-field generation: prompt an LLM with the instruction, the scene's object list, and a Python API for manipulating voxel grids. The LLM emits Python code that, when executed, returns the affordance and cost grids. Huang et al. used GPT-4 (March 2023 version) as the cost-field-generating model; in 2026, Claude 3.5 Sonnet and GPT-4o have replaced it in most reference reimplementations, primarily for the lower per-call latency.
Optimization: a classical trajectory optimizer (typically a sampling-based planner like RRT or a gradient-based optimizer over splines) finds the trajectory that maximizes integrated affordance minus integrated cost. VoxPoser ships with a greedy CEM (cross-entropy method) optimizer over a 32-waypoint spline; 200 CEM iterations converge in about 1.5 seconds on a 100x100x100 voxel grid.
Execution: a low-level controller tracks the planned trajectory. The standard implementation uses operational-space impedance control at 1 kHz, which lets the arm comply with small perception errors without re-invoking the planner.
Replanning: after each motion segment, the scene is re-perceived, the LLM re-emits cost fields if necessary, and the optimizer replans. In practice, the LLM is only re-invoked when an object is added or removed from the scene; pure-motion replanning reuses the previous cost field and runs at 5-10 Hz.

import numpy as np

# The skill-free API the LLM uses to build cost fields.
def get_empty_voxel_grid() -> np.ndarray:
    """Return a 100x100x100 float32 grid initialized to zero, covering 1 m^3 in front of the robot."""

def get_voxel_mass(description: str) -> np.ndarray:
    """Return a boolean voxel grid marking voxels occupied by the named object.

    Backed by a CLIP-based open-vocabulary detector plus depth estimation.
    """

def gaussian_blur(grid, sigma: float) -> np.ndarray:
    """Smooth a voxel grid with a 3D Gaussian. Useful for converting hard masks into soft fields."""

def distance_field(grid) -> np.ndarray:
    """Return the per-voxel Euclidean distance from the nearest True voxel."""

Code Fragment 24.9.1: The four primitives VoxPoser exposes to the LLM. The API is small on purpose: any cost field expressible as combinations of object masks, smoothing, and distance transforms can be built. Adding more primitives gives the LLM more expressive power but also more rope to hang itself with.

24.9.3 A Worked Example: "Pour Water Into the Cup"

The LLM, given the instruction "pour water from the bottle into the cup" and a scene with one bottle and one cup, emits the following cost-field program:

# LLM-generated cost field for "pour water from the bottle into the cup"
import numpy as np

affordance = get_empty_voxel_grid()
cost = get_empty_voxel_grid()

# Stage 1: end-effector should approach the bottle handle
bottle = get_voxel_mass("the bottle handle")
affordance += gaussian_blur(bottle.astype(np.float32), sigma=2.0)

# Stage 2: once gripping, end-effector position should be above the cup with the bottle tilted
cup = get_voxel_mass("the cup opening")
above_cup = np.roll(cup, shift=-10, axis=2)            # 10 cm above the cup
affordance += gaussian_blur(above_cup.astype(np.float32), sigma=2.0)

# Repulsive: avoid the table edge and any obstacles
obstacles = get_voxel_mass("any obstacle")
cost += 5 * obstacles.astype(np.float32)

# Hard constraint: never enter the no-fly zone around the user
user = get_voxel_mass("the human")
user_dist = distance_field(user)
cost += 100 * (user_dist < 20).astype(np.float32)

return affordance, cost

Code Fragment 24.9.2: The LLM-emitted cost field for a pouring task. The two-stage affordance (bottle then cup) gives the optimizer a multi-waypoint objective; the user-distance cost is a hard safety constraint. A trajectory optimizer running on these two grids finds a path that reaches the bottle, then arcs over to the cup, while staying away from the user.

Key Insight

The LLM does language; the optimizer does motion

VoxPoser separates concerns more cleanly than any earlier planner. The LLM understands "near the bottle handle", "above the cup", "avoid the user", and translates these into voxel grids. The optimizer understands collision-free trajectories, smoothness, and dynamics constraints. Neither model needs to know the other's specialty; the cost-grid interface is the bridge. This is the same separation-of-concerns pattern that makes function calling work: the LLM picks the function name and arguments, the runtime executes them. VoxPoser is "function calling, but the function is a numerical optimizer".

24.9.4 Where VoxPoser Shines

Three task categories play to VoxPoser's strengths. First, tasks with novel objects: because the LLM uses get_voxel_mass("the apple") rather than picking from a fixed skill list, novel objects are handled the moment the perception model recognizes them. Second, tasks with implicit spatial relationships: "place the cup just to the left of the plate" maps naturally to voxel-position arithmetic, where SayCan would need a left-of-plate skill or Code-as-Policies would need a careful place_at_relative_position API. Third, tasks with overlapping objectives: the optimizer happily integrates over both reaching and avoiding, so a single planning step can satisfy multiple constraints simultaneously.

Task type	SayCan	Code-as-Policies	VoxPoser
Pick known object	Excellent (has skill)	Excellent	Good
Pick novel object	Poor (no skill)	Good (needs perception)	Excellent
Spatial constraint ("3 cm left of X")	Poor	OK (with API)	Excellent
Multi-objective ("reach A, avoid B")	OK (two steps)	Good	Excellent
Long-horizon (10+ steps)	Good (designed for this)	Good	Poor (re-plans each step)
High-DOF dexterity	Poor	Poor	Poor

Figure 24.9.1a: VoxPoser versus SayCan and Code-as-Policies across task categories. VoxPoser wins on spatial constraints and novel objects; loses on long-horizon planning where reranking discrete steps beats reoptimizing continuous fields.

24.9.5 Where VoxPoser Struggles

VoxPoser inherits all the limitations of classical motion planning. It assumes a static-or-slowly-changing scene; rapid replanning is expensive because the voxel grid must be rebuilt. It assumes a sparse action representation (end-effector position over time); it does not handle the discrete-action-flow of "open the gripper" cleanly, requiring auxiliary signals or a skill-level overlay. It is bottlenecked by the perception model: if get_voxel_mass("the small red screw") returns empty, the planner has no fallback. And the LLM cost-field programs occasionally produce empty or degenerate fields when the instruction is ambiguous, leading to optimizer failures that are harder to diagnose than the discrete failures of SayCan or Code-as-Policies.

Warning: Voxel-grid resolution is a hidden knob

VoxPoser at 1 cm grid spacing gives smooth trajectories but blows up to 1 million voxels per grid, which is slow to populate, blur, and optimize over. At 2 cm spacing the optimizer is 8x faster but cannot represent objects smaller than ~3 cm. At 0.5 cm spacing the optimizer hits memory limits on a 24 GB GPU. The default 1 cm is a compromise that fails on either end of the precision-versus-throughput trade-off. Production VoxPoser-style systems often use adaptive resolution (fine grids near objects of interest, coarse grids elsewhere), which adds engineering complexity but solves the trade-off.

24.9.6 Modern Extensions: VoxPoser in 2026

The 2026 successors to VoxPoser embed the same idea in a richer scene representation. Three notable threads:

Gaussian-splat scenes (covered in Chapter 40). Replace the voxel grid with a 3D Gaussian Splatting representation of the scene. The LLM emits a program that selects which Gaussians are attractive or repulsive. Inference is faster because the splat representation is sparser than a dense voxel grid; the precision is higher because splats can be sub-voxel-sized.

Continuous neural cost fields. Replace the voxel grid with a coordinate-input MLP that outputs (affordance, cost) at any 3D point. The LLM emits a small program that builds the MLP by composing pretrained sub-networks. This is more memory-efficient but harder to debug.

VLM-direct cost-field generation. Skip the Python-emit step entirely: a VLM takes the instruction and the scene image, and produces the affordance and cost fields directly via a learned regression head. This is the cleanest formulation but requires substantial training data, which is hard to obtain at scale.

Research Frontier: What VoxPoser still does that VLAs do not

A VLA (Chapter 24) is an end-to-end neural policy: image plus instruction goes in, motor commands come out. It does not produce an inspectable plan. VoxPoser produces an explicit cost field that a human can visualize and verify. For safety-critical robotics (medical, manufacturing, transportation) the inspectable-plan property is a hard requirement, which is why VoxPoser-style planners survive in 2026 production stacks despite the rise of VLAs. The two are complementary: use the VLA for the low-level reaching motion, use VoxPoser to provide the high-level spatial intent the VLA executes against.

Key Takeaway

Key Insight

VoxPoser has an LLM emit Python that fills a 3D affordance grid and a 3D cost grid; a classical trajectory optimizer then finds the path that maximizes affordance minus cost. The decomposition (LLM does language-to-spatial-intent, optimizer does spatial-intent-to-motion) gives a planner that handles novel objects and spatial constraints elegantly but struggles with long-horizon tasks. Production stacks in 2026 use it as the spatial-intent layer on top of VLA executors.

Self-Check

Q1: Why does VoxPoser handle "place the cup 3 cm to the left of the plate" more easily than SayCan or Code-as-Policies? Tie your answer to the voxel-grid representation.

Show Answer

SayCan and Code-as-Policies operate over a discrete skill vocabulary (pick, place, move-to-pose); spatial offsets like "3 cm left of plate" must be smuggled in through an argument to place(target), and the planner has no native way to reason about that offset. VoxPoser's voxel grid is a continuous metric scaffold: the LLM emits Python that writes a positive affordance Gaussian centered on (plate_x minus 3 cm, plate_y, plate_z) directly into the grid. The classical trajectory optimizer then finds a path to that voxel automatically, no skill primitive required. The advantage compounds for spatial relations like "between," "above," or "halfway from A to B," all of which are easy to express as voxel-grid edits but cumbersome as skill arguments.

Q2: You are replacing the voxel grid with a Gaussian Splat representation. What two properties of the splat representation make the optimizer faster? What property makes it harder to debug?

Show Answer

Speed property one: splats are sparse, you store only a few thousand anisotropic Gaussians instead of a dense grid of millions of voxels, so the cost evaluation is O(splats) rather than O(voxels) and memory bandwidth drops by an order of magnitude. Speed property two: splat cost is differentiable in closed form, so gradient-based trajectory optimizers (CHOMP, STOMP, TrajOpt) get analytic gradients without finite differencing. The debugging downside is the loss of a grid index: with a voxel grid you can dump a 3D slice as an image and see exactly which cells the LLM lit up. With splats you must rasterize back to a grid for visualization, and overlapping Gaussians can produce non-obvious cost surfaces where the high-cost region is not at any single splat center but in the interference pattern between them.

Q3: Sketch the failure mode where the LLM emits a cost field with zero affordance mass (the optimizer has no target). What runtime check would catch this before motion begins?

Show Answer

The LLM may misinterpret the instruction (it grounds "place the cup on the plate" against the wrong object, or both objects are out of view) and emit only cost-region voxels with no positive affordance. The optimizer then finds the lowest-cost reachable voxel, which can be anywhere in the workspace, producing a confidently-wrong motion. The runtime check is a sentinel before optimization: assert that the affordance grid has at least one voxel above a small positive threshold (e.g., affordance.max() > tau) and that the argmax lies inside the reachable workspace. If either fails, the planner refuses to execute and reissues the LLM call with the failure as feedback, which is the same exception-feedback pattern from Section 24.8.

What's Next

Continue to Section 24.10: Multi-Robot Dispatch via Shared LLM.

Section 24.10 turns to a different scaling axis: not one robot with one planner, but many robots sharing a single LLM dispatcher. The Multi-Robot Dispatch pattern uses an LLM to coordinate a fleet through natural-language task assignment.

Further Reading

Huang, W., et al. (2023). VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. "CoRL 2023, arXiv:2307.05973".

Minderer, M., et al. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT). "ECCV 2022, arXiv:2205.06230".

Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision (CLIP). "ICML 2021, arXiv:2103.00020".

Liang, J., et al. (2023). Code as Policies: Language Model Programs for Embodied Control. "ICRA 2023, arXiv:2209.07753".

Kerbl, B., et al. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. "SIGGRAPH 2023, arXiv:2308.04079".

Yu, T., et al. (2023). Language to Rewards for Robotic Skill Synthesis. "CoRL 2023, arXiv:2306.08647".