Section 24.12: Comparing the Planners

"Four planners, four trade-offs, one decision rubric. Pick the planner that matches your robot, not the paper."
Compass, Trade-Off-Matrix-Native AI Agent

Big Picture

Sections 24.7 through 24.11 covered four planning paradigms: SayCan's skill-ranking product, Code-as-Policies' executable-program approach, VoxPoser's spatial-cost-field optimization, and the RT-X-style end-to-end VLA-as-planner. This section puts them side by side. Each has a deployment regime where it dominates, and getting the choice wrong is expensive in both engineering time and failure rate. The comparison below is the working consensus from 2025-2026 deployment retrospectives.

Prerequisites

This section assumes the planner architectures from Section 24.7 through Section 24.11. The comparison-matrix methodology is covered in detail later in the book.

24.12.1 The Four Paradigms at a Glance

Fun Fact

SayCan (Ahn et al., 2022) was the first paper to formalize the "LLM planner plus low-level policy" decomposition, and the demonstration was Google's PaLM model planning kitchen tasks for an Everyday Robot. The original codebase used roughly 100 hand-trained imitation-learning skills; the team estimates the skill collection cost more engineering time than the LLM planning research itself, a ratio that has stayed true for SayCan-style systems ever since.

The simplest way to see the structural difference is to look at what each paradigm produces as its output.

Paradigm	LLM emits	Robot consumes	Requires skill library?
SayCan	Probability over discrete skill names	Argmax skill, runs its trained policy	Yes (large, per-skill imitation learning)
Code-as-Policies	Python program (or structured JSON)	Python interpreter executing skill API calls	Yes (smaller, exposed as Python functions)
VoxPoser	Python that fills affordance + cost grids	Classical trajectory optimizer over grids	No (skills replaced by optimizer)
End-to-end VLA	Motor tokens directly	Action detokenizer + low-level controller	No (single network does everything)

Table 24.12.1: The four planning paradigms by output type. The progression is from highly-structured (discrete skill names) to fully-continuous (motor tokens), with increasing model responsibility at each step.

24.12.2 The Trade-Off Matrix

Each paradigm trades off across five dimensions that matter for real deployments: long-horizon reasoning, novel-object handling, spatial precision, safety auditability, and engineering cost. Below is the working evaluation.

Dimension	SayCan	Code-as-Policies	VoxPoser	End-to-end VLA
Long-horizon planning (10+ steps)	Excellent	Excellent	Poor	Limited (context)
Novel objects (zero-shot)	Poor	Good	Excellent	Good
Spatial precision (sub-cm)	Poor	OK (with skills)	Excellent	OK
Safety auditability	Excellent	Excellent	Good	Poor
Time to first demo	Weeks (skill collection)	Days (API design)	Days	Hours (off-the-shelf)
Production engineering cost	High (per-skill data)	Medium (API + sandbox)	Medium (perception + opt.)	Low (one model)
Cross-task generalization	Poor (closed skill set)	Good	Excellent	Excellent (if trained on it)
Real-time control (above 10 Hz)	n/a (it is the planner)	n/a	n/a	Excellent (with optimization)

Table 24.12.2: Trade-off matrix across the four paradigms. No single row is dominant. The right architecture depends on which trade-offs your project tolerates and which it cannot.

Key Insight: The answer is rarely "pick one"

Production stacks in 2026 combine paradigms. The dominant pattern is "Code-as-Policies at the top, VLA at the bottom": an LLM emits a Python plan whose primitives are themselves VLA sub-instructions, and the VLA carries each sub-instruction to motor commands. SayCan and VoxPoser slot in as specialized layers within this hybrid: SayCan for safety-critical reranking, VoxPoser for spatially-precise placement tasks. The pure-paradigm question "is your robot SayCan or Code-as-Policies" is a false dichotomy that academic papers have created but production rarely faces.

24.12.3 The Decision Tree

The practical question is not "what is the best planner" but "which planner do I start with for my specific deployment". The decision tree below is the working heuristic from 2025-2026 deployment interviews.

def choose_planner(project):
    """Working decision tree from 2025-2026 deployment retrospectives."""
    if project.tasks_are_safety_critical:
        # Medical, surgical, industrial automation
        return "SayCan-style discrete skill ranking with formal verification"

    if project.novel_objects_per_week > 10:
        # Home robot, lab assistant, open environment
        if project.precision_required_mm < 10:
            return "VoxPoser for spatial precision over open vocabulary"
        else:
            return "Code-as-Policies with VLA skill primitives"

    if project.task_horizon_steps > 20:
        # Multi-step household, warehouse picking, long workflows
        return "Code-as-Policies (control flow handles long horizons)"

    if project.single_task_family and project.has_demo_data:
        # Specialized industrial cell, fixed task
        return "End-to-end VLA finetuned on your task family"

    # The default: hybrid stack with all four paradigms slotted in
    return "Code-as-Policies at the top, VLA primitives, VoxPoser for tricky placements"

Code Fragment 24.12.1a: The planner-selection decision tree as a Python function. The default branch (hybrid stack) is the answer in most non-specialized cases. The four explicit branches identify the regimes where a single paradigm clearly dominates.

24.12.4 Latency and Throughput Comparison

Beyond capability, the four paradigms differ in their inference cost. The numbers below assume a moderate-fleet manipulation task on a single robot with a single A100 inference server.

Paradigm	Planning latency	Per-step cost	Replanning rate
SayCan	1-3 s (LLM ranking, ~N skills)	~3 LLM calls per step	Per skill (every 5-20 s)
Code-as-Policies	2-5 s (LLM program emission)	1 LLM call for full plan	On exception only
VoxPoser	3-10 s (LLM + perception + opt.)	1 LLM call + optimizer	Per motion segment
End-to-end VLA	30-200 ms (per action)	1 forward pass	Continuous (5-25 Hz)

Figure 24.12.1b: Latency profiles. The VLA dominates on reactive control rates; the LLM-based planners dominate on plan quality and interpretability. The hybrid pattern uses the VLA for control and an LLM-based planner for the high-level structure, which is what production stacks settle on.

Real-World Scenario: "Warehouse picking" case study

A 2025 warehouse-robotics deployment chose Code-as-Policies for the high-level task ("pick the items in this order") combined with VLA primitives for the low-level grasping. SayCan was ruled out because adding skills for new product SKUs (which happens daily) was too expensive. VoxPoser was ruled out because the planning latency was too high for the bin-picking throughput targets. Pure end-to-end VLA was ruled out because the warehouse management system needed structured task IDs and progress tracking that an end-to-end model does not natively expose. The hybrid was the working answer.

24.12.5 The Interpretability Spectrum

A second axis cuts across the four paradigms: how interpretable is the plan? SayCan is the most interpretable (each step is a named skill that humans can audit); Code-as-Policies is nearly as interpretable (a Python program is human-readable); VoxPoser is partially interpretable (you can visualize the cost field but not necessarily the optimizer's trajectory); end-to-end VLAs are nearly opaque (the action stream comes out without a structured intermediate representation). Interpretability matters less in research demos and more in production deployments where audit logs are required.

Plan inspectability question	SayCan	CaP	VoxPoser	VLA
Can you see the high-level plan?	Yes (skill list)	Yes (Python)	Partial (cost field)	No
Can you re-run with a different LLM?	Yes	Yes	Yes	No (entangled)
Can you human-edit the plan?	Yes (rerank skills)	Yes (edit Python)	Partial (edit grid)	No
Can you formally verify the plan?	Yes (small skill set)	Partial (typed AST)	Partial (constraint check)	No
Can you replay the plan offline?	Yes	Yes	Yes	Yes (with same observation)

Table 24.12.4: Plan inspectability across the four paradigms. The trade-off mirrors a familiar pattern: more learned end-to-end means higher capability and lower auditability. For safety-critical settings, the inspectability is often what locks in the choice of paradigm regardless of capability differences.

Research Frontier: The convergence in late 2025

By Q4 2025 the four paradigms have begun to converge on a unified architecture. The "modern planner" uses structured tool calling (CaP-style) where each tool is a learned skill (SayCan-style) implemented as a VLA conditioned on a sub-instruction (end-to-end VLA), with VoxPoser-style spatial cost fields available as a tool the LLM can invoke when fine spatial reasoning is needed. The four paradigms become four tools in a single hybrid agent. Research papers in 2026 increasingly stop arguing for a single paradigm and start contributing to specific layers of the hybrid stack.

24.12.6 What Does Not Fit the Comparison

Two paradigms exist outside this matrix. Inner Monologue (Huang et al., 2022) is an early SayCan extension where the LLM reflects on its own plan execution using natural-language scratchpad reasoning; modern instruction-tuned LLMs make this redundant by default. RT-H and similar hierarchical end-to-end models embed a planning stage inside the VLA itself, so the structure of "LLM plans then VLA executes" collapses into a single network with two output heads; this is structurally an end-to-end VLA in our matrix but with extra interpretability through the intermediate output head. Both are worth knowing but do not change the four-corner trade-off space.

Key Takeaway

Key Insight

SayCan, Code-as-Policies, VoxPoser, and end-to-end VLA are four points on a spectrum from structured-discrete to learned-continuous. Each has a regime where it dominates: SayCan for safety-critical small-skill-set tasks, Code-as-Policies for long-horizon multi-step plans, VoxPoser for novel-object spatial reasoning, end-to-end VLA for reactive control on tasks the model was trained on. Production stacks in 2026 are hybrids that use Code-as-Policies as the high-level glue and VLA-as-tool at the bottom, with SayCan and VoxPoser slotted in where their strengths matter.

Self-Check

Q1: You are building a robot that performs medical equipment sterilization in hospitals. Which planner paradigm dominates, and why? Tie your answer to the trade-off matrix in Figure 24.12.2a.

Show Answer

SayCan dominates this task. The skill vocabulary is small and well-defined (lift instrument, place in autoclave, set cycle, retrieve), and the safety profile demands an explicit, auditable plan rather than a learned end-to-end controller. SayCan's structured skill ranking gives you a discrete plan you can show to a hospital quality engineer and a regulator, and the worst-case behavior is bounded by the affordance function. VLAs lose here because their behavior is opaque and difficult to certify under FDA Quality System Regulation. Code-as-Policies wins on long-horizon tasks but the hospital task is short-horizon; VoxPoser wins on novel-object spatial reasoning but the sterilization skill set is fixed. The trade-off matrix in Figure 24.12.2 places SayCan at "high interpretability, low novelty," which matches this regime exactly.

Q2: Sketch the hybrid stack from Section 5: which paradigm sits at the top, which provides the primitives, and how do they communicate?

Show Answer

Code-as-Policies sits at the top as the long-horizon glue: an LLM emits a Python program whose body is a sequence of high-level skill calls. The bottom layer is a VLA acting as the executor for each skill call, taking image and instruction in and producing motor commands at 6 Hz. VoxPoser may slot in between when a skill requires spatial-grounding the VLA was not trained on (e.g., "place the cup left of the plate" on novel objects); SayCan slots in when the skill's safety case requires an explicit affordance-ranked sub-plan. Communication is by typed function calls: the Code-as-Policies layer invokes vla.execute(instruction, frames) or voxposer.plan_motion(instruction), gets back a status and optional return value, and uses it as a normal Python expression in subsequent steps.

Q3: For each of the four paradigms, identify one task that exposes its biggest weakness. (Hint: the weakness is always the inverse of the strength in Figure 24.12.2b.)

Show Answer

SayCan: a free-form home-organization task ("clean my desk"), where the skill vocabulary is too small to capture the long tail of placements and the affordance function gives the planner nothing useful to rank. Code-as-Policies: a reactive task with sub-second control loops (catching a falling glass), where the LLM-emit-code-then-execute latency is fatal. VoxPoser: an articulation-heavy task (opening a refrigerator with a stubborn seal), where the voxel-grid abstraction has no way to represent the joint structure of the door or the required force profile. End-to-end VLA: a task on completely novel objects the model never trained on (a brand of cup unseen in Open X-Embodiment), where the learned policy fails silently and confidently because it has no internal mechanism to recognize the out-of-distribution input.

What's Next

Continue to Section 24.13: Sim-to-Real Gap.

The final section of this chapter, Section 24.13, drops down to the deployment layer: the sim-to-real gap is the dominant obstacle between a working planner and a working robot, and the patterns for closing it (domain randomization, real-world fine-tuning, hardware-in-the-loop validation) are what separate research demos from shipped products.

Further Reading

Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). "CoRL 2022, arXiv:2204.01691".

Liang, J., et al. (2023). Code as Policies: Language Model Programs for Embodied Control. "ICRA 2023, arXiv:2209.07753".

Huang, W., et al. (2023). VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. "CoRL 2023, arXiv:2307.05973".

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Huang, W., et al. (2022). Inner Monologue: Embodied Reasoning through Planning with Language Models. "CoRL 2022, arXiv:2207.05608".

Belkhale, S., et al. (2024). RT-H: Action Hierarchies Using Language. "RSS 2024, arXiv:2403.01823".