Code-as-Policies

Section 24.8

"When the planner is the programmer, every robot is a chat-completion away from a new behavior."

TensorTensor, Compile-And-Run AI Agent
Big Picture

Code-as-Policies (Liang et al., 2023, arXiv:2209.07753) generalized SayCan by replacing "rank a skill from a fixed list" with "write Python code that uses skills as function calls". The LLM emits an executable program; the robot runtime executes it line by line, with the LLM's program calling into a library of perception and motor primitives. This unlocks loops, conditionals, recursion, and arbitrary composition that SayCan's flat skill ranking cannot express. As of 2026 it is the dominant paradigm for high-level planning in research robotics and an increasingly common pattern in production.

Prerequisites

This section assumes the LLM-as-planner pattern from Section 24.7 and basic Python control-flow fluency. Tool-use and code-generation patterns are covered in detail later in the book.

24.8.1 The Paradigm Shift: Plan as Program

Fun Fact

Code-as-Policies (Liang et al., 2022) was developed at Google Brain Robotics and treated robot programs as Python. The team reportedly debated whether to use a domain-specific language and settled on plain Python because "LLMs already know Python"; this turned out to be the correct call by a wide margin. The same team later folded into Google DeepMind's broader robotics agent work, which is why Gemini Robotics in 2026 still feels structurally like a Code-as-Policies descendant.

SayCan's plan is a list of skill names: ["go to kitchen", "pick up Coke", "go to user", "place Coke"]. Code-as-Policies' plan is a Python program:

# LLM-generated plan for "bring me a Coke and clean up the spill"
go_to("kitchen")
coke = find_object("Coke can")
pick_up(coke)
sponge = find_object("sponge")
pick_up(sponge)
go_to("user")
place(coke, "table")
spill = find_region("spill on table")
wipe(sponge, spill)
Code Fragment 24.8.1: A Code-as-Policies plan for "bring me a Coke and clean up the spill". The LLM emits Python that calls into a robot skill API (go_to, find_object, pick_up, place, wipe). Unlike a SayCan list of skill names, this is executable source that supports variables, references, and (later in the section) loops and conditionals.

The shift looks small but is structurally profound. The LLM is no longer producing natural-language steps that a separate module must interpret. It is producing source code that runs against an API. The runtime executes the code; the LLM's job is the same as a programmer's: pick the right function calls, in the right order, with the right arguments.

Key Insight: The robot is now an API client

Code-as-Policies makes the robot indistinguishable from any other API-based system. The LLM's job is to write a Python script against a fixed library; the library happens to actuate motors and read cameras instead of querying a database. Every tool-use pattern from Part VI, function calling, MCP servers, ReAct, applies unchanged. This is why the team that built Code-as-Policies (Google Brain Robotics) later folded into the broader function-calling agent community: the technical problem is the same. The robot is an agent whose tools are physical actions.

24.8.2 The Skill Library as an API Surface

The skill library in Code-as-Policies is a Python API. Each skill is a function the LLM can call. The API is documented in the LLM's prompt as docstrings, so the LLM knows what functions exist and what arguments they take. Designing this API is the load-bearing engineering work: the API is the bridge between language and motor commands.

from typing import Optional

def go_to(location: str) -> None:
    """Navigate the robot to a named location.

    Valid locations include: kitchen, living_room, bedroom, hallway, charging_dock.
    Raises LocationNotFoundError if the location is unknown.
    """

def find_object(description: str, timeout: float = 5.0) -> ObjectHandle:
    """Visually locate the object matching the natural-language description.

    Returns a handle that can be passed to pick_up(), place(), etc.
    Raises ObjectNotFoundError if no matching object is visible within timeout.
    """

def pick_up(obj: ObjectHandle) -> None:
    """Grasp the object and lift it off its current support surface.

    Raises GraspFailedError on failure.
    """

def place(obj: ObjectHandle, target: str) -> None:
    """Place the currently-grasped object at a named target (table, shelf, etc.)."""

def wipe(tool: ObjectHandle, region: RegionHandle) -> None:
    """Wipe a 2D region on a flat surface using the held tool (e.g. a sponge)."""
Code Fragment 24.8.2: A representative Code-as-Policies skill API. The docstrings double as LLM documentation, the type annotations enable runtime checking, and the exception classes give the LLM something concrete to handle. This is the same API design that goes into any well-built tool-use schema.

24.8.3 Runtime: LLM as a Just-in-Time Programmer

The Code-as-Policies runtime gives the LLM the instruction, the skill API (as a stub-only module), a small number of in-context examples, and asks it to emit a Python program. The runtime then executes the program in a sandboxed environment that has the real skill implementations available. Crucially, when a skill raises an exception (object not found, grasp failed), control returns to the LLM, which can examine the exception and emit a corrective program. This is the same retry-with-error-feedback pattern from Chapter 27 on tool use, applied to robotics.

import subprocess
import textwrap

class CodeAsPolicyRuntime:
    def __init__(self, llm, skill_module, max_retries=3):
        self.llm = llm
        self.skills = skill_module
        self.max_retries = max_retries

    def run(self, instruction: str, scene_description: str):
        prompt = self._build_prompt(instruction, scene_description)
        for attempt in range(self.max_retries):
            program = self.llm.complete(prompt)
            try:
                exec(program, {**self.skills.__dict__})
                return True
            except Exception as e:
                prompt += f"\n# Attempt {attempt+1} raised: {type(e).__name__}: {e}\n# Please write a corrected program:\n"
        return False

    def _build_prompt(self, instruction, scene):
        return textwrap.dedent(f"""
            You are programming a household robot. Available functions:
              {self._skill_docs()}

            Current scene: {scene}
            Instruction: {instruction}

            Write a Python program that completes the instruction. Use only the functions listed above.
            Output the program only, no explanation.
        """)
Code Fragment 24.8.3: The Code-as-Policies runtime. On a skill exception, the runtime appends the traceback to the prompt and asks the LLM to write a corrected program. This retry-with-error loop is the same pattern that powers Claude Code and Cursor; the only difference is that the "compile errors" come from a robot rather than from a Python interpreter on a laptop.
Warning: Sandboxing is non-optional

Executing LLM-generated code on the same Python interpreter that controls a 30 kg robot is a security and safety nightmare. Production Code-as-Policies deployments run the LLM-generated program in a tightly restricted sandbox: no network, no filesystem access, only the skill API in scope, no exec / eval / __import__ nested calls. The sandbox is typically a separate process with the skill API exposed over gRPC. Mistakes here have already caused real incidents in academic-lab deployments; do not skip this layer.

24.8.4 What Control Flow Buys You

The leverage Code-as-Policies has over SayCan comes from Python's control structures. Three patterns matter:

Loops over object collections. "Put all the red blocks in the bin" becomes:

blocks = find_all_objects("red block")
for b in blocks:
    pick_up(b)
    place(b, "bin")

which a SayCan ranker would have to expand into ad-hoc replanning at each step.

Conditional branches. "If the door is open, walk through; otherwise, open it first" becomes:

if not is_door_open("front door"):
    open_door("front door")
go_to("outside")

Helper functions. Complex tasks can be decomposed into helper functions the LLM defines on the fly:

def stack_on(block, target):
    pick_up(block)
    place(block, target)

blocks = find_all_objects("colored block")
tower_base = blocks[0]
for b in blocks[1:]:
    stack_on(b, tower_base)
    tower_base = b
Real-World Scenario: "Sort the dishes by size"

The canonical Code-as-Policies demo. Given a scene with five dishes of varying sizes, the LLM emits a program that calls find_all_objects("dish"), sorts the list by an estimated bounding-box size, and iterates through the sorted list to place each dish in a vertical stack. A flat SayCan ranker cannot express "sort by size and iterate"; it would have to enumerate every permutation, scoring each. Code-as-Policies expresses the algorithm in three lines and the LLM compiler emits exactly the program a competent intern would write.

24.8.5 The Failure Modes

Code-as-Policies has predictable failure modes that SayCan does not. The LLM can call functions that do not exist, pass arguments in the wrong order, or write programs that loop infinitely. The runtime catches these by wrapping the executor in exception handlers, timeout, and a hard call budget. Three failure categories dominate in 2026 deployments:

Failure categoryExampleMitigation
Hallucinated functionCalls vacuum_floor() when no such skill existsStatic check before exec; error feedback retry
Wrong argument typePasses a str where ObjectHandle is expectedPython typing + runtime isinstance check
Infinite loop"While not done: do_something()" without progressWall-clock timeout, max-iteration budget
Unsafe skill compositionDrops object while still moving, no place() callSkill-level preconditions; safety wrapper from 39.6
Misinterpretation"Clean the kitchen" emits code that throws away clean dishesConfirmation prompt; intent verification by a second LLM
Figure 24.8.1a: Code-as-Policies failure categories, ranked by frequency in 2025 production deployments. Hallucinated functions are the most common but easiest to catch; misinterpretation is the rarest but most dangerous and is the open research problem.

24.8.6 The 2026 Evolution: Structured Output and Tool Calling

The pure "LLM emits Python" pattern has matured into a structured-output pattern in 2026. Instead of free-form code, the LLM emits a JSON tool-call sequence that is interpreted by the runtime. The JSON form is easier to validate (every call has a schema), easier to log (the trace is structured), and easier to checkpoint (you can resume after a crash by replaying the JSON log). The expressive power is the same, because the JSON sequence supports conditionals and loops via nested structure or via an explicit "control-flow" command.

# The 2026 structured-output successor to Code-as-Policies.
plan = [
    {"call": "go_to", "args": {"location": "kitchen"}},
    {"call": "find_all_objects", "args": {"description": "red block"}, "bind": "blocks"},
    {
        "for_each": "blocks", "as": "b",
        "body": [
            {"call": "pick_up", "args": {"obj": "$b"}},
            {"call": "place", "args": {"obj": "$b", "target": "bin"}},
        ],
    },
]
Code Fragment 24.8.4: The same plan as Code Fragment 24.8.5, expressed as a structured tool-call sequence. The JSON form is what modern LLM APIs (OpenAI function calling, Anthropic tool use, Gemini tool definitions) natively emit. The semantics are identical to Python; the validation and logging are much easier.
Research Frontier: When code wins over JSON

Even in 2026, free-form Python still beats structured JSON on three task categories. (a) Tasks needing rich arithmetic on intermediate quantities ("place the block 5 cm to the left of the rightmost cup"). (b) Tasks needing list comprehensions or other functional patterns. (c) Tasks needing user-defined helper functions that get reused later. The dominant production pattern is "JSON for routine calls, Python escape hatch for the genuinely-programmatic tasks", with the LLM choosing between the two modes based on complexity. The OpenAI Code Interpreter and Claude Code Mode are concrete examples of this hybrid in chat-LLM products; the same pattern is reaching robotics in 2026.

Key Takeaway

Key Insight

Code-as-Policies has the LLM emit Python (or structured JSON) that calls into a robot skill API. The result is a planner with loops, conditionals, and helper functions, which expresses tasks SayCan's flat skill ranking cannot. The runtime is a sandboxed Python interpreter with the skill module in scope plus an exception-feedback retry loop. The 2026 successor pattern uses structured JSON tool-call sequences instead of free-form code, with a Python escape hatch for the genuinely-programmatic minority of tasks.

Self-Check
Q1: Express "stack three red blocks on top of the blue block" as both (a) a SayCan-style skill ranking sequence and (b) a Code-as-Policies Python program. Which representation is shorter? Which is easier to debug?
Show Answer
SayCan would enumerate roughly nine skill invocations: detect blue block, pick red block 1, place on blue block, detect stack top, pick red block 2, place on stack top, repeat for red block 3, then terminate. Code-as-Policies collapses this into a three-line Python loop: for red in get_red_blocks()[:3]: pick(red); place_on(get_stack_top(blue_block)). The code form is shorter because it expresses the repetition as a loop rather than unrolling it. Debugging favors Code-as-Policies because exceptions carry stack traces, line numbers, and variable values that you can replay; SayCan's flat skill list only tells you which step failed, not why the planner chose it.
Q2: The exception-feedback retry loop in Code Fragment 24.8.6 can loop forever if the LLM keeps emitting the same buggy program. Sketch one mitigation that does not require manual intervention.
Show Answer
Cap the loop at a small retry budget (typically three to five attempts), then escalate to a different strategy rather than abort. Two automated escalations work in practice: (a) fall back to a more capable model on the final retry (e.g., switch from Haiku to Opus), since the cheap model may have a blind spot the expensive one resolves; (b) decompose the failing task into two simpler sub-tasks and re-issue each independently, which reduces the chance that the planner repeats the same structural error. Either path bounds total compute and exit-codes the agent into a safe idle state if the retries are exhausted.
Q3: Why is sandboxing the LLM-generated code more critical for robotics than for a coding assistant like Claude Code? Identify two failure modes that are catastrophic in robotics but merely annoying in a developer tool.
Show Answer
A coding assistant's worst-case generated code damages files on disk, which is recoverable from version control. A robot's worst-case generated code drives a kilogram-scale actuator into a person or breaks a costly end-effector against a fixed obstacle. Failure mode one: an unbounded velocity command in a buggy loop (a swapped sign in set_velocity(v)); in software you see an infinite loop, in robotics you see a hardware crash. Failure mode two: file-system or shell-escape that lets the model bypass the action allowlist (e.g., calling os.system or a raw ROS topic publish), bypassing the joint-limit and workspace-bound guards that the skill API enforces. Sandboxing for robotics therefore restricts the call graph to a typed skill module, blocks all OS and network access, and enforces wall-clock and command-rate limits at the interpreter boundary.
What's Next

Continue to Section 24.9: VoxPoser: Language as Spatial Cost Field.

Section 24.9 covers VoxPoser, which keeps the LLM-as-planner pattern but changes the output: instead of a skill list or Python program, the LLM emits a 3D cost field that a classical optimizer turns into a trajectory. The shift gives a fundamentally different way to ground language in space.

Further Reading
Liang, J., et al. (2023). Code as Policies: Language Model Programs for Embodied Control. "ICRA 2023, arXiv:2209.07753".
Singh, I., et al. (2023). ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. "ICRA 2023, arXiv:2209.11302".
Wang, G., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. "NeurIPS 2023 Workshop, arXiv:2305.16291".
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. "NeurIPS 2023, arXiv:2302.04761".
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. "ICLR 2023, arXiv:2210.03629".
Anthropic. (2024). Tool Use with the Claude API. "Anthropic Documentation".