"Say what you want, and a value function will tell you whether you can. The robot does the rest."
Reward, Value-Function-Whisperer AI Agent
SayCan (Ahn et al., 2022, arXiv:2204.01691) was the first credible answer to "how do you get an LLM to plan for a real robot?" Its insight was to combine two probabilities that, separately, each fail. The LLM scores "what should we try next?" but does not know what the robot can physically do; the robot's value function scores "what can we do here?" but does not know which actions advance the goal. SayCan multiplies the two. The product is the affordance-grounded plan distribution that became the template for every LLM-planner approach since.
Prerequisites
This section assumes the affordance vocabulary from classical robotics (covered briefly in the intro to this chapter) and the policy-gradient value-function basics from Section 0.5. The LLM-as-planner pattern is developed in detail later in the book.
24.7.1 The SayCan Equation
The flagship SayCan demo at Google in 2022 was a robot fetching a sponge from a kitchen counter, narrated in real time by an LLM. The video went viral, then critics noticed the robot was carefully avoiding any task that required dexterity, novelty, or a non-cooperative human. "Bring me a sponge" became shorthand in robotics circles for the gap between a perfect demo loop and a real Tuesday morning. The acronym SayCan itself splits the credit cleanly: the LLM says what to do, the affordance model says what the robot can actually do, and the magic is in the multiplication.
Given a high-level instruction i (say, "bring me a Coke from the kitchen") and a library of low-level skills S = {s_1, s_2, ..., s_N} (pick up Coke, walk to kitchen, open fridge, etc.), SayCan selects the next skill by scoring each candidate under the joint distribution:
p(s_k | i, state) ~ p_LLM(s_k | i, history) * p_aff(s_k | state)
The left factor p_LLM is the language-model likelihood that skill s_k's natural-language description is the right next step in completing i, given the history of previously chosen skills. The right factor p_aff is the robot's learned value function evaluated at the current state, scoring whether s_k is executable here. Multiplying the two filters out plans the LLM would prefer but the robot cannot perform (suggest "open the fridge" when the robot is in the bedroom), and plans the robot can perform but the LLM did not pick (greedy executable actions that do not advance the goal).
The mnemonic in the name does the explanation for you. Say is the LLM's prior over what one might do next, derived purely from the instruction and a generic prior over reasonable plans. Can is the robot's affordance: what is physically possible from the current state. SayCan picks the action that the LLM says to do and that the robot can do, by literally multiplying the two scores. The simplicity is the point; everything later in this chapter is a refinement of this product.
24.7.2 The Skill Library: The Hidden Engineering Cost
SayCan presupposes a library of low-level skills, each with two artifacts: a natural-language description that the LLM can score, and a learned value function that can be evaluated on the current observation. In the original paper, the skill library had 551 entries spanning navigation, pick-and-place, and door manipulation on a mobile Everyday Robot. Each skill was trained separately with imitation learning and behavioral cloning on roughly 700 to 7,000 teleoperated demonstrations. The total demonstration cost for the SayCan skill library was on the order of 100,000 trajectories.
| Skill category | Number of skills | Demos per skill | Total demos |
|---|---|---|---|
| Pick (per object class) | ~75 (one per object) | ~500 | ~37,500 |
| Place (per location) | ~120 (per surface/container) | ~300 | ~36,000 |
| Navigation (per location) | ~150 (per named room) | ~150 | ~22,500 |
| Door manipulation | ~12 (open, close per door type) | ~700 | ~8,400 |
| Total (SayCan original) | ~551 | ~100,000+ |
The most-cited critique of SayCan in 2026 is that the skill library does not scale gracefully. Every new task family (cooking, cleaning, assembly) requires another round of per-skill imitation learning, with all the data-collection cost that implies. This is the bottleneck that VLAs from Chapter 24 were designed to dissolve: a single VLA replaces the entire library with one model that emits motor commands from natural language, eliminating per-skill training. SayCan-style architectures remain important when you need explicit planning over discrete skills (industrial settings, multi-robot coordination), but for single-robot manipulation the modern stack collapses the skill layer into the VLA.
24.7.3 The Reference Implementation
The original SayCan implementation runs a beam search over plan continuations, with each beam step scored by the SayCan product. The LLM (originally PaLM-540B; in 2026 reimplementations typically GPT-4o or Claude Sonnet) is prompted with the instruction, a list of available skills, and the history of already-chosen skills. The value function for each candidate skill is evaluated against the current observation. The top-K continuations under the joint score advance to the next iteration.
import numpy as np
class SayCanPlanner:
def __init__(self, llm, skills):
self.llm = llm # anything with a logprob(text) method
self.skills = skills # list of (description, value_fn, policy) tuples
def score(self, skill, instruction, history, observation):
desc, value_fn, _ = skill
history_str = "\n".join(f"{i+1}. {h}" for i, h in enumerate(history))
prompt = f"Goal: {instruction}\nPlan so far:\n{history_str}\nNext step:"
say = self.llm.logprob(desc, prompt=prompt) # LLM "say" score
can = np.log(value_fn(observation) + 1e-9) # affordance "can" score
return say + can # log-sum is product of probs
def plan(self, instruction, observation_fn, max_steps=10):
history = []
for _ in range(max_steps):
obs = observation_fn()
scores = [(self.score(s, instruction, history, obs), s)
for s in self.skills]
best = max(scores, key=lambda x: x[0])[1]
if best[0] == "DONE":
return history
desc, _, policy = best
policy.execute()
history.append(desc)
return history
24.7.4 Prompt Engineering for the LLM Side
The original SayCan paper used few-shot examples to teach PaLM what "scoring a skill description" looked like. With instruction-tuned models in 2026 (GPT-4o, Claude Sonnet 4, Gemini Ultra 2), the prompt can be much shorter, but the structure remains the same: provide the goal, the history of executed steps, and a clear ask for the LLM to rank the candidate skills. The prompt that works best in practice is:
SAYCAN_PROMPT = """You are a robot planning assistant. The robot has a fixed library of skills.
Given a goal and the steps already taken, score each candidate skill's likelihood of being the right next step.
Goal: {instruction}
Steps so far:
{history}
Candidate next skills:
{candidate_descriptions}
For each candidate, output a number from 0.0 to 1.0 representing how likely it is to be the right next step.
Output strictly as a JSON object mapping skill name to score, no other text.
"""
Consider the SayCan paper's flagship example. Instruction: "I just spilled my coffee, please bring me a Coke and a sponge." A naive LLM expands this into "1. find sponge, 2. wipe spill, 3. find Coke, 4. bring Coke." A naive affordance-only planner picks the closest executable skill each step, which is "walk forward" because that is the highest-value action everywhere. SayCan multiplies the two and gets "1. find Coke (LLM likes it, robot can do it from this room), 2. bring Coke (LLM likes it, robot can do it now that Coke is grasped), 3. find sponge, 4. wipe spill." The plan succeeds because every step is both goal-relevant and executable. Neither factor alone produces this plan; the product does.
24.7.5 What SayCan Got Right and Wrong
What SayCan got right was the separation of concerns. The LLM does world knowledge and goal decomposition, which it is good at; the robot's value functions do executability, which only the robot can know. The product is a clean two-factor model that exposes both factors for inspection: when the planner fails, you can ask "did the LLM rank the right skill highly?" and separately "did the affordance function correctly score the state?" Each is debuggable in isolation. This separation is the conceptual contribution that survived into modern stacks.
What SayCan got wrong was the skill-library assumption. The system needs a closed library of skills with hand-trained value functions, which is exactly the bottleneck VLAs are designed to remove. A 2026 SayCan-style planner uses a much smaller library (10-20 skills, where each is a "VLA conditioned on a sub-instruction"), and the LLM is responsible for decomposing the goal into sub-instructions rather than picking from a discrete list. This is the "SayCan with a VLA backbone" pattern that dominates production multi-step robot deployments today.
In 2026, "SayCan" rarely means the original paper's architecture. It means the conceptual pattern: an LLM produces a sub-instruction, a VLA executes it, the robot returns a new observation, the LLM decides the next sub-instruction. The "value function" piece is either absorbed into the VLA (which simply fails the sub-instruction if it cannot execute) or replaced by an affordance prediction from a separate vision model (CLIP-based or VLM-based). The conceptual product p_LLM * p_aff survives; the implementation collapses.
24.7.6 Where SayCan-Style Planners Still Win
SayCan-style hierarchical planners remain competitive against pure end-to-end VLAs in three regimes. First, long-horizon tasks (more than ~10 sub-steps) where a flat VLA's context window cannot hold the full plan. Second, multi-robot coordination (covered in Section 24.10) where the planner must reason about which robot does which step. Third, safety-critical settings where each sub-step needs to be inspectable and pre-approved before execution; this is the operating regime for warehouse, manufacturing, and surgical robotics in 2026.
Section 24.8 covers Code-as-Policies, the competing approach where the LLM emits executable Python rather than ranking a fixed skill library. The two paradigms have largely converged in late 2025: modern planners (Inner Monologue, AutoRT, Stanford ALFWorld-2) use a hybrid where the LLM emits a structured plan that looks like Python (assignment, control flow) but compiles into calls against a skill library. SayCan's "rank a discrete skill" formulation is the special case where the plan has no control flow; Code-as-Policies' "emit arbitrary Python" formulation is the general case. The product structure from this section survives in both.
Key Takeaway
SayCan grounds an LLM's plan in robot affordances by multiplying two probabilities: p_LLM(skill | goal, history) and p_aff(skill | state). The product gives plans that are simultaneously goal-relevant and executable. The conceptual factor still drives modern hierarchical planners; the original closed-skill-library assumption has been replaced by VLA-based executors that absorb the affordance check.
p_aff?Show Answer
Show Answer
Show Answer
Continue to Section 24.8: Code-as-Policies.
Section 24.8 moves to the second major LLM-planning paradigm: Code-as-Policies. Instead of ranking a fixed skill list, the LLM writes the plan as Python code that uses the skills as function calls. The generalization is significant; the failure modes are different.