"Four planners, four trade-offs, one decision rubric. Pick the planner that matches your robot, not the paper."
Compass, Trade-Off-Matrix-Native AI Agent
Sections 24.7 through 24.11 covered four planning paradigms: SayCan's skill-ranking product, Code-as-Policies' executable-program approach, VoxPoser's spatial-cost-field optimization, and the RT-X-style end-to-end VLA-as-planner. This section puts them side by side. Each has a deployment regime where it dominates, and getting the choice wrong is expensive in both engineering time and failure rate. The comparison below is the working consensus from 2025-2026 deployment retrospectives.
Prerequisites
This section assumes the planner architectures from Section 24.7 through Section 24.11. The comparison-matrix methodology is covered in detail later in the book.
24.12.1 The Four Paradigms at a Glance
SayCan (Ahn et al., 2022) was the first paper to formalize the "LLM planner plus low-level policy" decomposition, and the demonstration was Google's PaLM model planning kitchen tasks for an Everyday Robot. The original codebase used roughly 100 hand-trained imitation-learning skills; the team estimates the skill collection cost more engineering time than the LLM planning research itself, a ratio that has stayed true for SayCan-style systems ever since.
The simplest way to see the structural difference is to look at what each paradigm produces as its output.
| Paradigm | LLM emits | Robot consumes | Requires skill library? |
|---|---|---|---|
| SayCan | Probability over discrete skill names | Argmax skill, runs its trained policy | Yes (large, per-skill imitation learning) |
| Code-as-Policies | Python program (or structured JSON) | Python interpreter executing skill API calls | Yes (smaller, exposed as Python functions) |
| VoxPoser | Python that fills affordance + cost grids | Classical trajectory optimizer over grids | No (skills replaced by optimizer) |
| End-to-end VLA | Motor tokens directly | Action detokenizer + low-level controller | No (single network does everything) |
24.12.2 The Trade-Off Matrix
Each paradigm trades off across five dimensions that matter for real deployments: long-horizon reasoning, novel-object handling, spatial precision, safety auditability, and engineering cost. Below is the working evaluation.
| Dimension | SayCan | Code-as-Policies | VoxPoser | End-to-end VLA |
|---|---|---|---|---|
| Long-horizon planning (10+ steps) | Excellent | Excellent | Poor | Limited (context) |
| Novel objects (zero-shot) | Poor | Good | Excellent | Good |
| Spatial precision (sub-cm) | Poor | OK (with skills) | Excellent | OK |
| Safety auditability | Excellent | Excellent | Good | Poor |
| Time to first demo | Weeks (skill collection) | Days (API design) | Days | Hours (off-the-shelf) |
| Production engineering cost | High (per-skill data) | Medium (API + sandbox) | Medium (perception + opt.) | Low (one model) |
| Cross-task generalization | Poor (closed skill set) | Good | Excellent | Excellent (if trained on it) |
| Real-time control (above 10 Hz) | n/a (it is the planner) | n/a | n/a | Excellent (with optimization) |
Production stacks in 2026 combine paradigms. The dominant pattern is "Code-as-Policies at the top, VLA at the bottom": an LLM emits a Python plan whose primitives are themselves VLA sub-instructions, and the VLA carries each sub-instruction to motor commands. SayCan and VoxPoser slot in as specialized layers within this hybrid: SayCan for safety-critical reranking, VoxPoser for spatially-precise placement tasks. The pure-paradigm question "is your robot SayCan or Code-as-Policies" is a false dichotomy that academic papers have created but production rarely faces.
24.12.3 The Decision Tree
The practical question is not "what is the best planner" but "which planner do I start with for my specific deployment". The decision tree below is the working heuristic from 2025-2026 deployment interviews.
def choose_planner(project):
"""Working decision tree from 2025-2026 deployment retrospectives."""
if project.tasks_are_safety_critical:
# Medical, surgical, industrial automation
return "SayCan-style discrete skill ranking with formal verification"
if project.novel_objects_per_week > 10:
# Home robot, lab assistant, open environment
if project.precision_required_mm < 10:
return "VoxPoser for spatial precision over open vocabulary"
else:
return "Code-as-Policies with VLA skill primitives"
if project.task_horizon_steps > 20:
# Multi-step household, warehouse picking, long workflows
return "Code-as-Policies (control flow handles long horizons)"
if project.single_task_family and project.has_demo_data:
# Specialized industrial cell, fixed task
return "End-to-end VLA finetuned on your task family"
# The default: hybrid stack with all four paradigms slotted in
return "Code-as-Policies at the top, VLA primitives, VoxPoser for tricky placements"
24.12.4 Latency and Throughput Comparison
Beyond capability, the four paradigms differ in their inference cost. The numbers below assume a moderate-fleet manipulation task on a single robot with a single A100 inference server.
| Paradigm | Planning latency | Per-step cost | Replanning rate |
|---|---|---|---|
| SayCan | 1-3 s (LLM ranking, ~N skills) | ~3 LLM calls per step | Per skill (every 5-20 s) |
| Code-as-Policies | 2-5 s (LLM program emission) | 1 LLM call for full plan | On exception only |
| VoxPoser | 3-10 s (LLM + perception + opt.) | 1 LLM call + optimizer | Per motion segment |
| End-to-end VLA | 30-200 ms (per action) | 1 forward pass | Continuous (5-25 Hz) |
A 2025 warehouse-robotics deployment chose Code-as-Policies for the high-level task ("pick the items in this order") combined with VLA primitives for the low-level grasping. SayCan was ruled out because adding skills for new product SKUs (which happens daily) was too expensive. VoxPoser was ruled out because the planning latency was too high for the bin-picking throughput targets. Pure end-to-end VLA was ruled out because the warehouse management system needed structured task IDs and progress tracking that an end-to-end model does not natively expose. The hybrid was the working answer.
24.12.5 The Interpretability Spectrum
A second axis cuts across the four paradigms: how interpretable is the plan? SayCan is the most interpretable (each step is a named skill that humans can audit); Code-as-Policies is nearly as interpretable (a Python program is human-readable); VoxPoser is partially interpretable (you can visualize the cost field but not necessarily the optimizer's trajectory); end-to-end VLAs are nearly opaque (the action stream comes out without a structured intermediate representation). Interpretability matters less in research demos and more in production deployments where audit logs are required.
| Plan inspectability question | SayCan | CaP | VoxPoser | VLA |
|---|---|---|---|---|
| Can you see the high-level plan? | Yes (skill list) | Yes (Python) | Partial (cost field) | No |
| Can you re-run with a different LLM? | Yes | Yes | Yes | No (entangled) |
| Can you human-edit the plan? | Yes (rerank skills) | Yes (edit Python) | Partial (edit grid) | No |
| Can you formally verify the plan? | Yes (small skill set) | Partial (typed AST) | Partial (constraint check) | No |
| Can you replay the plan offline? | Yes | Yes | Yes | Yes (with same observation) |
By Q4 2025 the four paradigms have begun to converge on a unified architecture. The "modern planner" uses structured tool calling (CaP-style) where each tool is a learned skill (SayCan-style) implemented as a VLA conditioned on a sub-instruction (end-to-end VLA), with VoxPoser-style spatial cost fields available as a tool the LLM can invoke when fine spatial reasoning is needed. The four paradigms become four tools in a single hybrid agent. Research papers in 2026 increasingly stop arguing for a single paradigm and start contributing to specific layers of the hybrid stack.
24.12.6 What Does Not Fit the Comparison
Two paradigms exist outside this matrix. Inner Monologue (Huang et al., 2022) is an early SayCan extension where the LLM reflects on its own plan execution using natural-language scratchpad reasoning; modern instruction-tuned LLMs make this redundant by default. RT-H and similar hierarchical end-to-end models embed a planning stage inside the VLA itself, so the structure of "LLM plans then VLA executes" collapses into a single network with two output heads; this is structurally an end-to-end VLA in our matrix but with extra interpretability through the intermediate output head. Both are worth knowing but do not change the four-corner trade-off space.
Key Takeaway
SayCan, Code-as-Policies, VoxPoser, and end-to-end VLA are four points on a spectrum from structured-discrete to learned-continuous. Each has a regime where it dominates: SayCan for safety-critical small-skill-set tasks, Code-as-Policies for long-horizon multi-step plans, VoxPoser for novel-object spatial reasoning, end-to-end VLA for reactive control on tasks the model was trained on. Production stacks in 2026 are hybrids that use Code-as-Policies as the high-level glue and VLA-as-tool at the bottom, with SayCan and VoxPoser slotted in where their strengths matter.
Show Answer
Show Answer
vla.execute(instruction, frames) or voxposer.plan_motion(instruction), gets back a status and optional return value, and uses it as a normal Python expression in subsequent steps.Show Answer
Continue to Section 24.13: Sim-to-Real Gap.
The final section of this chapter, Section 24.13, drops down to the deployment layer: the sim-to-real gap is the dominant obstacle between a working planner and a working robot, and the patterns for closing it (domain randomization, real-world fine-tuning, hardware-in-the-loop validation) are what separate research demos from shipped products.