"The map is the cost field, and the cost field is the language. Move accordingly."
Compass, Voxel-Grid-Native AI Agent
VoxPoser (Huang et al., 2023, arXiv:2307.05973) took a different path from SayCan and Code-as-Policies. Instead of producing a sequence of skill calls, an LLM generates Python code that fills a 3D voxel grid with attractive and repulsive cost values, and then a classical trajectory optimizer plans a path through the cost field. The result is a planner that does not need a skill library at all: the LLM controls the cost field, the optimizer controls the motion, and any task expressible as "go to high-reward voxels while avoiding high-cost ones" can be planned without per-skill data.
Prerequisites
This section assumes the LLM-grounded planning pattern from Section 24.7 and Section 24.8, plus the basic 3D scene-representation vocabulary from Section 23.1.
24.9.1 The VoxPoser Idea in One Paragraph
VoxPoser (Huang et al., 2023) was the first published work to use an LLM as a cost-field designer rather than a skill picker. The original demo had GPT-4 generating Python that called functions like get_voxel_mass("the apple"), and the trajectory optimizer was a stock CHOMP-style planner from a 2009 paper. The headline was "language as cost field"; the engineering surprise was that classical motion planners suddenly looked smart when the LLM gave them the right objective.
Discretize the workspace into a 3D voxel grid, say 100 by 100 by 100 voxels of 1 cm each. For a given instruction, an LLM writes Python code that populates two grids: a target-affordance grid A(x, y, z) with positive values at voxels the end-effector should reach, and a cost grid C(x, y, z) with positive values at voxels it should avoid. A motion planner then finds the trajectory that maximizes integrated affordance minus integrated cost. The trick is that "high-reward voxels" and "high-cost voxels" can be language-grounded: the LLM can call object-detection primitives like get_voxel_mass("the apple") to populate the grids based on what is in the scene.
VoxPoser's contribution is the realization that you can specify a manipulation task entirely by a cost field, without naming any "pick" or "place" skill. "Put the apple in the bowl" becomes "attractive voxels at the apple, then attractive voxels at the bowl, repulsive voxels at obstacles", and a generic trajectory optimizer handles the motion. The LLM is no longer the planner; it is the cost-field designer. This is a meaningfully different decomposition from SayCan (LLM ranks skills) or Code-as-Policies (LLM calls skill APIs); the skills themselves have been collapsed into the optimization objective.
24.9.2 The Reference Pipeline
VoxPoser's pipeline has five stages. Each stage swaps a familiar robotics component for an LLM-driven equivalent, but only the second stage (cost-field generation) is novel; the rest is classical robotics literature. The total round-trip latency in Huang et al.'s 2023 reference implementation was about 4 seconds for the LLM call plus 1.5 seconds for trajectory optimization on a single A100, which sets the practical floor for "interactive" use.
- Perception: detect objects in the scene and lift them into 3D positions using an off-the-shelf VLM (CLIP, OWL-ViT) plus depth from a stereo camera or learned monocular depth estimator. The original paper used OWL-ViT-Base (Minderer et al., 2022) with a RealSense D435 depth camera; segmentation latency is roughly 200 ms per frame and dominates the perception budget.
- LLM cost-field generation: prompt an LLM with the instruction, the scene's object list, and a Python API for manipulating voxel grids. The LLM emits Python code that, when executed, returns the affordance and cost grids. Huang et al. used GPT-4 (March 2023 version) as the cost-field-generating model; in 2026, Claude 3.5 Sonnet and GPT-4o have replaced it in most reference reimplementations, primarily for the lower per-call latency.
- Optimization: a classical trajectory optimizer (typically a sampling-based planner like RRT or a gradient-based optimizer over splines) finds the trajectory that maximizes integrated affordance minus integrated cost. VoxPoser ships with a greedy CEM (cross-entropy method) optimizer over a 32-waypoint spline; 200 CEM iterations converge in about 1.5 seconds on a 100x100x100 voxel grid.
- Execution: a low-level controller tracks the planned trajectory. The standard implementation uses operational-space impedance control at 1 kHz, which lets the arm comply with small perception errors without re-invoking the planner.
- Replanning: after each motion segment, the scene is re-perceived, the LLM re-emits cost fields if necessary, and the optimizer replans. In practice, the LLM is only re-invoked when an object is added or removed from the scene; pure-motion replanning reuses the previous cost field and runs at 5-10 Hz.
import numpy as np
# The skill-free API the LLM uses to build cost fields.
def get_empty_voxel_grid() -> np.ndarray:
"""Return a 100x100x100 float32 grid initialized to zero, covering 1 m^3 in front of the robot."""
def get_voxel_mass(description: str) -> np.ndarray:
"""Return a boolean voxel grid marking voxels occupied by the named object.
Backed by a CLIP-based open-vocabulary detector plus depth estimation.
"""
def gaussian_blur(grid, sigma: float) -> np.ndarray:
"""Smooth a voxel grid with a 3D Gaussian. Useful for converting hard masks into soft fields."""
def distance_field(grid) -> np.ndarray:
"""Return the per-voxel Euclidean distance from the nearest True voxel."""
24.9.3 A Worked Example: "Pour Water Into the Cup"
The LLM, given the instruction "pour water from the bottle into the cup" and a scene with one bottle and one cup, emits the following cost-field program:
# LLM-generated cost field for "pour water from the bottle into the cup"
import numpy as np
affordance = get_empty_voxel_grid()
cost = get_empty_voxel_grid()
# Stage 1: end-effector should approach the bottle handle
bottle = get_voxel_mass("the bottle handle")
affordance += gaussian_blur(bottle.astype(np.float32), sigma=2.0)
# Stage 2: once gripping, end-effector position should be above the cup with the bottle tilted
cup = get_voxel_mass("the cup opening")
above_cup = np.roll(cup, shift=-10, axis=2) # 10 cm above the cup
affordance += gaussian_blur(above_cup.astype(np.float32), sigma=2.0)
# Repulsive: avoid the table edge and any obstacles
obstacles = get_voxel_mass("any obstacle")
cost += 5 * obstacles.astype(np.float32)
# Hard constraint: never enter the no-fly zone around the user
user = get_voxel_mass("the human")
user_dist = distance_field(user)
cost += 100 * (user_dist < 20).astype(np.float32)
return affordance, cost
VoxPoser separates concerns more cleanly than any earlier planner. The LLM understands "near the bottle handle", "above the cup", "avoid the user", and translates these into voxel grids. The optimizer understands collision-free trajectories, smoothness, and dynamics constraints. Neither model needs to know the other's specialty; the cost-grid interface is the bridge. This is the same separation-of-concerns pattern that makes function calling work: the LLM picks the function name and arguments, the runtime executes them. VoxPoser is "function calling, but the function is a numerical optimizer".
24.9.4 Where VoxPoser Shines
Three task categories play to VoxPoser's strengths. First, tasks with novel objects: because the LLM uses get_voxel_mass("the apple") rather than picking from a fixed skill list, novel objects are handled the moment the perception model recognizes them. Second, tasks with implicit spatial relationships: "place the cup just to the left of the plate" maps naturally to voxel-position arithmetic, where SayCan would need a left-of-plate skill or Code-as-Policies would need a careful place_at_relative_position API. Third, tasks with overlapping objectives: the optimizer happily integrates over both reaching and avoiding, so a single planning step can satisfy multiple constraints simultaneously.
| Task type | SayCan | Code-as-Policies | VoxPoser |
|---|---|---|---|
| Pick known object | Excellent (has skill) | Excellent | Good |
| Pick novel object | Poor (no skill) | Good (needs perception) | Excellent |
| Spatial constraint ("3 cm left of X") | Poor | OK (with API) | Excellent |
| Multi-objective ("reach A, avoid B") | OK (two steps) | Good | Excellent |
| Long-horizon (10+ steps) | Good (designed for this) | Good | Poor (re-plans each step) |
| High-DOF dexterity | Poor | Poor | Poor |
24.9.5 Where VoxPoser Struggles
VoxPoser inherits all the limitations of classical motion planning. It assumes a static-or-slowly-changing scene; rapid replanning is expensive because the voxel grid must be rebuilt. It assumes a sparse action representation (end-effector position over time); it does not handle the discrete-action-flow of "open the gripper" cleanly, requiring auxiliary signals or a skill-level overlay. It is bottlenecked by the perception model: if get_voxel_mass("the small red screw") returns empty, the planner has no fallback. And the LLM cost-field programs occasionally produce empty or degenerate fields when the instruction is ambiguous, leading to optimizer failures that are harder to diagnose than the discrete failures of SayCan or Code-as-Policies.
VoxPoser at 1 cm grid spacing gives smooth trajectories but blows up to 1 million voxels per grid, which is slow to populate, blur, and optimize over. At 2 cm spacing the optimizer is 8x faster but cannot represent objects smaller than ~3 cm. At 0.5 cm spacing the optimizer hits memory limits on a 24 GB GPU. The default 1 cm is a compromise that fails on either end of the precision-versus-throughput trade-off. Production VoxPoser-style systems often use adaptive resolution (fine grids near objects of interest, coarse grids elsewhere), which adds engineering complexity but solves the trade-off.
24.9.6 Modern Extensions: VoxPoser in 2026
The 2026 successors to VoxPoser embed the same idea in a richer scene representation. Three notable threads:
Gaussian-splat scenes (covered in Chapter 40). Replace the voxel grid with a 3D Gaussian Splatting representation of the scene. The LLM emits a program that selects which Gaussians are attractive or repulsive. Inference is faster because the splat representation is sparser than a dense voxel grid; the precision is higher because splats can be sub-voxel-sized.
Continuous neural cost fields. Replace the voxel grid with a coordinate-input MLP that outputs (affordance, cost) at any 3D point. The LLM emits a small program that builds the MLP by composing pretrained sub-networks. This is more memory-efficient but harder to debug.
VLM-direct cost-field generation. Skip the Python-emit step entirely: a VLM takes the instruction and the scene image, and produces the affordance and cost fields directly via a learned regression head. This is the cleanest formulation but requires substantial training data, which is hard to obtain at scale.
A VLA (Chapter 24) is an end-to-end neural policy: image plus instruction goes in, motor commands come out. It does not produce an inspectable plan. VoxPoser produces an explicit cost field that a human can visualize and verify. For safety-critical robotics (medical, manufacturing, transportation) the inspectable-plan property is a hard requirement, which is why VoxPoser-style planners survive in 2026 production stacks despite the rise of VLAs. The two are complementary: use the VLA for the low-level reaching motion, use VoxPoser to provide the high-level spatial intent the VLA executes against.
Key Takeaway
VoxPoser has an LLM emit Python that fills a 3D affordance grid and a 3D cost grid; a classical trajectory optimizer then finds the path that maximizes affordance minus cost. The decomposition (LLM does language-to-spatial-intent, optimizer does spatial-intent-to-motion) gives a planner that handles novel objects and spatial constraints elegantly but struggles with long-horizon tasks. Production stacks in 2026 use it as the spatial-intent layer on top of VLA executors.
Show Answer
place(target), and the planner has no native way to reason about that offset. VoxPoser's voxel grid is a continuous metric scaffold: the LLM emits Python that writes a positive affordance Gaussian centered on (plate_x minus 3 cm, plate_y, plate_z) directly into the grid. The classical trajectory optimizer then finds a path to that voxel automatically, no skill primitive required. The advantage compounds for spatial relations like "between," "above," or "halfway from A to B," all of which are easy to express as voxel-grid edits but cumbersome as skill arguments.Show Answer
Show Answer
affordance.max() > tau) and that the argmax lies inside the reachable workspace. If either fails, the planner refuses to execute and reissues the LLM call with the failure as feedback, which is the same exception-feedback pattern from Section 24.8.Continue to Section 24.10: Multi-Robot Dispatch via Shared LLM.
Section 24.10 turns to a different scaling axis: not one robot with one planner, but many robots sharing a single LLM dispatcher. The Multi-Robot Dispatch pattern uses an LLM to coordinate a fleet through natural-language task assignment.