Section 24.7: SayCan: Grounding LLM Plans

"Say what you want, and a value function will tell you whether you can. The robot does the rest."
Reward, Value-Function-Whisperer AI Agent

Big Picture

SayCan (Ahn et al., 2022, arXiv:2204.01691) was the first credible answer to "how do you get an LLM to plan for a real robot?" Its insight was to combine two probabilities that, separately, each fail. The LLM scores "what should we try next?" but does not know what the robot can physically do; the robot's value function scores "what can we do here?" but does not know which actions advance the goal. SayCan multiplies the two. The product is the affordance-grounded plan distribution that became the template for every LLM-planner approach since.

Prerequisites

This section assumes the affordance vocabulary from classical robotics (covered briefly in the intro to this chapter) and the policy-gradient value-function basics from Section 0.5. The LLM-as-planner pattern is developed in detail later in the book.

24.7.1 The SayCan Equation

Fun Fact

The flagship SayCan demo at Google in 2022 was a robot fetching a sponge from a kitchen counter, narrated in real time by an LLM. The video went viral, then critics noticed the robot was carefully avoiding any task that required dexterity, novelty, or a non-cooperative human. "Bring me a sponge" became shorthand in robotics circles for the gap between a perfect demo loop and a real Tuesday morning. The acronym SayCan itself splits the credit cleanly: the LLM says what to do, the affordance model says what the robot can actually do, and the magic is in the multiplication.

Given a high-level instruction i (say, "bring me a Coke from the kitchen") and a library of low-level skills S = {s_1, s_2, ..., s_N} (pick up Coke, walk to kitchen, open fridge, etc.), SayCan selects the next skill by scoring each candidate under the joint distribution:

p(s_k | i, state) ~ p_LLM(s_k | i, history) * p_aff(s_k | state)

Code Fragment 24.7.1: The SayCan scoring equation. The probability of picking skill s_k factors into the LLM's "say" score (how plausible the skill description is given the instruction and history) times the affordance model's "can" score (how feasible the skill is from the current state). Multiplying the two filters out plans the LLM prefers but the robot cannot execute.

The left factor p_LLM is the language-model likelihood that skill s_k's natural-language description is the right next step in completing i, given the history of previously chosen skills. The right factor p_aff is the robot's learned value function evaluated at the current state, scoring whether s_k is executable here. Multiplying the two filters out plans the LLM would prefer but the robot cannot perform (suggest "open the fridge" when the robot is in the bedroom), and plans the robot can perform but the LLM did not pick (greedy executable actions that do not advance the goal).

Key Insight: The "Say" and the "Can"

The mnemonic in the name does the explanation for you. Say is the LLM's prior over what one might do next, derived purely from the instruction and a generic prior over reasonable plans. Can is the robot's affordance: what is physically possible from the current state. SayCan picks the action that the LLM says to do and that the robot can do, by literally multiplying the two scores. The simplicity is the point; everything later in this chapter is a refinement of this product.

24.7.2 The Skill Library: The Hidden Engineering Cost

SayCan presupposes a library of low-level skills, each with two artifacts: a natural-language description that the LLM can score, and a learned value function that can be evaluated on the current observation. In the original paper, the skill library had 551 entries spanning navigation, pick-and-place, and door manipulation on a mobile Everyday Robot. Each skill was trained separately with imitation learning and behavioral cloning on roughly 700 to 7,000 teleoperated demonstrations. The total demonstration cost for the SayCan skill library was on the order of 100,000 trajectories.

Skill category	Number of skills	Demos per skill	Total demos
Pick (per object class)	~75 (one per object)	~500	~37,500
Place (per location)	~120 (per surface/container)	~300	~36,000
Navigation (per location)	~150 (per named room)	~150	~22,500
Door manipulation	~12 (open, close per door type)	~700	~8,400
Total (SayCan original)	~551		~100,000+

Table 24.7.1a: The SayCan skill library at Google's Everyday Robot project, 2022. The skill-by-skill imitation-learning pipeline is what made the system work, but also what made it expensive to extend. Adding a new object class meant collecting another 500 demos and retraining the corresponding value function.

Warning: The skill-library bottleneck

The most-cited critique of SayCan in 2026 is that the skill library does not scale gracefully. Every new task family (cooking, cleaning, assembly) requires another round of per-skill imitation learning, with all the data-collection cost that implies. This is the bottleneck that VLAs from Chapter 24 were designed to dissolve: a single VLA replaces the entire library with one model that emits motor commands from natural language, eliminating per-skill training. SayCan-style architectures remain important when you need explicit planning over discrete skills (industrial settings, multi-robot coordination), but for single-robot manipulation the modern stack collapses the skill layer into the VLA.

24.7.3 The Reference Implementation

The original SayCan implementation runs a beam search over plan continuations, with each beam step scored by the SayCan product. The LLM (originally PaLM-540B; in 2026 reimplementations typically GPT-4o or Claude Sonnet) is prompted with the instruction, a list of available skills, and the history of already-chosen skills. The value function for each candidate skill is evaluated against the current observation. The top-K continuations under the joint score advance to the next iteration.

import numpy as np

class SayCanPlanner:
    def __init__(self, llm, skills):
        self.llm = llm                # anything with a logprob(text) method
        self.skills = skills          # list of (description, value_fn, policy) tuples

    def score(self, skill, instruction, history, observation):
        desc, value_fn, _ = skill
        history_str = "\n".join(f"{i+1}. {h}" for i, h in enumerate(history))
        prompt = f"Goal: {instruction}\nPlan so far:\n{history_str}\nNext step:"
        say = self.llm.logprob(desc, prompt=prompt)     # LLM "say" score
        can = np.log(value_fn(observation) + 1e-9)     # affordance "can" score
        return say + can                              # log-sum is product of probs

    def plan(self, instruction, observation_fn, max_steps=10):
        history = []
        for _ in range(max_steps):
            obs = observation_fn()
            scores = [(self.score(s, instruction, history, obs), s)
                      for s in self.skills]
            best = max(scores, key=lambda x: x[0])[1]
            if best[0] == "DONE":
                return history
            desc, _, policy = best
            policy.execute()
            history.append(desc)
        return history

Code Fragment 24.7.2: A reference SayCan planner in ~30 lines. The LLM scores skill descriptions against the instruction and history; the affordance function scores them against the observation; we pick the max of the sum (which is the log of the product). The skill library is the input; the entire conceptual content of SayCan is on lines 9-13.

24.7.4 Prompt Engineering for the LLM Side

The original SayCan paper used few-shot examples to teach PaLM what "scoring a skill description" looked like. With instruction-tuned models in 2026 (GPT-4o, Claude Sonnet 4, Gemini Ultra 2), the prompt can be much shorter, but the structure remains the same: provide the goal, the history of executed steps, and a clear ask for the LLM to rank the candidate skills. The prompt that works best in practice is:

SAYCAN_PROMPT = """You are a robot planning assistant. The robot has a fixed library of skills.
Given a goal and the steps already taken, score each candidate skill's likelihood of being the right next step.

Goal: {instruction}

Steps so far:
{history}

Candidate next skills:
{candidate_descriptions}

For each candidate, output a number from 0.0 to 1.0 representing how likely it is to be the right next step.
Output strictly as a JSON object mapping skill name to score, no other text.
"""

Code Fragment 24.7.3: The prompt template that empirically works best with instruction-tuned LLMs in 2026. The JSON-only output constraint plus a deterministic skill-name vocabulary makes the LLM side debuggable and lets you swap LLM providers without rewriting the planner.

Real-World Scenario

The "Coke from the kitchen" canonical example

Consider the SayCan paper's flagship example. Instruction: "I just spilled my coffee, please bring me a Coke and a sponge." A naive LLM expands this into "1. find sponge, 2. wipe spill, 3. find Coke, 4. bring Coke." A naive affordance-only planner picks the closest executable skill each step, which is "walk forward" because that is the highest-value action everywhere. SayCan multiplies the two and gets "1. find Coke (LLM likes it, robot can do it from this room), 2. bring Coke (LLM likes it, robot can do it now that Coke is grasped), 3. find sponge, 4. wipe spill." The plan succeeds because every step is both goal-relevant and executable. Neither factor alone produces this plan; the product does.

24.7.5 What SayCan Got Right and Wrong

What SayCan got right was the separation of concerns. The LLM does world knowledge and goal decomposition, which it is good at; the robot's value functions do executability, which only the robot can know. The product is a clean two-factor model that exposes both factors for inspection: when the planner fails, you can ask "did the LLM rank the right skill highly?" and separately "did the affordance function correctly score the state?" Each is debuggable in isolation. This separation is the conceptual contribution that survived into modern stacks.

What SayCan got wrong was the skill-library assumption. The system needs a closed library of skills with hand-trained value functions, which is exactly the bottleneck VLAs are designed to remove. A 2026 SayCan-style planner uses a much smaller library (10-20 skills, where each is a "VLA conditioned on a sub-instruction"), and the LLM is responsible for decomposing the goal into sub-instructions rather than picking from a discrete list. This is the "SayCan with a VLA backbone" pattern that dominates production multi-step robot deployments today.

Key Insight: Modern SayCan = LLM planner + VLA executor

In 2026, "SayCan" rarely means the original paper's architecture. It means the conceptual pattern: an LLM produces a sub-instruction, a VLA executes it, the robot returns a new observation, the LLM decides the next sub-instruction. The "value function" piece is either absorbed into the VLA (which simply fails the sub-instruction if it cannot execute) or replaced by an affordance prediction from a separate vision model (CLIP-based or VLM-based). The conceptual product p_LLM * p_aff survives; the implementation collapses.

24.7.6 Where SayCan-Style Planners Still Win

SayCan-style hierarchical planners remain competitive against pure end-to-end VLAs in three regimes. First, long-horizon tasks (more than ~10 sub-steps) where a flat VLA's context window cannot hold the full plan. Second, multi-robot coordination (covered in Section 24.10) where the planner must reason about which robot does which step. Third, safety-critical settings where each sub-step needs to be inspectable and pre-approved before execution; this is the operating regime for warehouse, manufacturing, and surgical robotics in 2026.

Research Frontier: The "plan = code" alternative

Section 24.8 covers Code-as-Policies, the competing approach where the LLM emits executable Python rather than ranking a fixed skill library. The two paradigms have largely converged in late 2025: modern planners (Inner Monologue, AutoRT, Stanford ALFWorld-2) use a hybrid where the LLM emits a structured plan that looks like Python (assignment, control flow) but compiles into calls against a skill library. SayCan's "rank a discrete skill" formulation is the special case where the plan has no control flow; Code-as-Policies' "emit arbitrary Python" formulation is the general case. The product structure from this section survives in both.

Key Takeaway

Key Insight

SayCan grounds an LLM's plan in robot affordances by multiplying two probabilities: p_LLM(skill | goal, history) and p_aff(skill | state). The product gives plans that are simultaneously goal-relevant and executable. The conceptual factor still drives modern hierarchical planners; the original closed-skill-library assumption has been replaced by VLA-based executors that absorb the affordance check.

Self-Check

Q1: Write the SayCan product on the log scale and explain why it reduces to addition. Why does the log scale matter for numerical stability when many candidate skills have very low p_aff?

Show Answer

The SayCan score is $p_{LLM}(s) \cdot p_{aff}(s)$. Taking logs gives $\log p_{LLM}(s) + \log p_{aff}(s)$, turning the product into a sum because the logarithm is multiplicative-to-additive. The log scale matters because typical $p_{aff}$ values are tiny (10e-6 or smaller for most candidate skills that are not currently executable); multiplying many such numbers underflows float32. Working in log space keeps everything within the representable range, lets you compare scores by simple addition, and matches how LLM scoring APIs natively return log-probabilities. Numerical safety also lets you add a small $\epsilon$ (as Code Fragment 24.7.1 does) before taking log to avoid -inf for skills whose value function returns exactly zero.

Q2: You have a SayCan planner with 100 skills. Estimate the per-step LLM cost in (a) the original paper's autoregressive-scoring formulation (one prompt per skill) and (b) the modern JSON-output formulation in Code Fragment 24.7.4 (one prompt total). Which dominates the wall-clock latency budget?

Show Answer

(a) Per-skill scoring: 100 separate LLM calls, each evaluating the log-probability of one skill description against the prompt. At around 300 ms per call (modern frontier API with cache), that is 30 seconds per planning step, dominated by round-trip latency. (b) JSON-output formulation: one call that returns scores for all 100 skills in a single response. Single call, around 1 to 2 seconds wall-clock including output token generation. The JSON-output approach is roughly 15 to 30 times faster, which is the difference between a planner that re-plans every 30 seconds (unusable for closed-loop control) and one that re-plans every 2 seconds (acceptable for slow manipulation). Wall-clock latency is dominated by the LLM step in both cases, but only (b) makes that latency tolerable.

Q3: Sketch a failure case where SayCan's product score still picks the wrong skill. Tie your failure case to either the LLM side or the affordance side, and propose a third factor that would catch it.

Show Answer

Failure case: the user asks "bring me the Coke" in a kitchen with both a Coke can and a similarly-shaped iced-tea can on the counter. The LLM ranks "pick up the Coke can" very highly. The affordance function for that skill is high because the object is in reach. SayCan picks "pick up the Coke can", but the perception module misidentifies the iced-tea can as a Coke can, and the robot brings the wrong drink. The failure is on neither the LLM side (it correctly wants Coke) nor the affordance side (the skill is executable); it is in the perception-to-skill grounding. A third factor that would catch this is a verification score from a separate VLM that takes the candidate target object plus the instruction and asks "is this really a Coke?". Multiplying all three (LLM x affordance x verification) introduces an explicit grounding check that catches misidentifications the affordance function silently swallows.

What's Next

Continue to Section 24.8: Code-as-Policies.

Section 24.8 moves to the second major LLM-planning paradigm: Code-as-Policies. Instead of ranking a fixed skill list, the LLM writes the plan as Python code that uses the skills as function calls. The generalization is significant; the failure modes are different.

Further Reading

Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). "CoRL 2022, arXiv:2204.01691".

Huang, W., et al. (2022). Inner Monologue: Embodied Reasoning through Planning with Language Models. "CoRL 2022, arXiv:2207.05608".

Liang, J., et al. (2023). Code as Policies: Language Model Programs for Embodied Control. "ICRA 2023, arXiv:2209.07753".

Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. "ICML 2023, arXiv:2303.03378".

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. "arXiv:2307.15818".

Ahn, M., et al. (2024). AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. "arXiv:2401.12963".