Section 28.7: Robotics, Embodied AI & Scientific Discovery

"I can reason about the world with words. Now give me a body, and I will reason about it with actions."
Deploy, Freshly Embodied AI Agent

Big Picture

LLMs are extending from digital text into the physical world and the frontiers of science. In robotics, LLMs serve as high-level planners that translate natural language instructions into sequences of robot actions. In web and OS automation, they operate as agents that navigate interfaces, fill forms, and complete tasks on behalf of users. In scientific discovery, they mine literature, generate hypotheses, prove theorems, and design experiments. These applications represent the cutting edge of what LLMs can do when connected to real-world actuators and scientific knowledge. The agent planning and tool use patterns from Section 22.3 and Section 22.2 are the core enabling capabilities.

Prerequisites

This section requires familiarity with the application patterns from Section 28.1 through Section 28.6. Understanding evaluation basics from Section 29.1 will help with assessing the production readiness of LLM applications.

1. LLMs as Robot Planners

The key insight behind using LLMs for robotics is that language models possess extensive world knowledge about objects, their properties, and how they relate to each other.

Fun Fact

When Google first tested SayCan, the robot correctly understood "I spilled my drink, can you help?" and fetched a sponge. It also tried to "clean up" a coworker's lunch. World knowledge, it turns out, needs better boundaries.

A human saying "make me a sandwich" implies a sequence of actions (get bread, get ingredients, assemble, plate) that an LLM can decompose into steps. The challenge is grounding these steps in the robot's actual physical capabilities and environment.

SayCan: Grounding Language in Robot Actions

Google's SayCan combines an LLM's knowledge of what makes sense to do with a robot's learned affordances (what it can physically do). The LLM proposes candidate next actions, and a value function scores each action based on whether the robot can actually execute it in the current state. This product of "what should I do" (LLM) and "what can I do" (affordance model) produces grounded action plans that are both semantically correct and physically feasible. Figure 28.7.1 shows the SayCan architecture.

Figure 28.7.1: SayCan architecture. The LLM proposes actions scored by affordance models, and the robot executes the highest-scoring feasible action.

RT-2: Vision-Language-Action Models

Google's RT-2 (Robotics Transformer 2) takes grounding further by training a single vision-language model that directly outputs robot actions. The model processes camera images and language instructions and outputs discretized action tokens (arm positions, gripper states). By co-training on both internet-scale vision-language data and robot demonstration data, RT-2 acquires emergent reasoning capabilities: it can follow instructions involving concepts never seen during robot training (like "move the object to the picture of a country" by recognizing flags). Code Fragment 28.7.2 below puts this into practice.

# Conceptual: LLM as robot task planner
from openai import OpenAI
import json

client = OpenAI()

def plan_robot_actions(
 instruction: str,
 available_actions: list,
 scene_description: str,
) -> list:
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": f"""You are a robot task planner.
Available primitive actions: {json.dumps(available_actions)}
Current scene: {scene_description}
Decompose the instruction into a sequence of available actions.
Return a JSON array of action steps with 'action' and 'target'."""},
 {"role": "user", "content": instruction},
 ],
 response_format={"type": "json_object"},
 )
 return json.loads(response.choices[0].message.content)

plan = plan_robot_actions(
 instruction="Put the dirty dishes in the dishwasher",
 available_actions=["pick", "place", "open", "close", "navigate"],
 scene_description="Kitchen counter with 3 plates and 2 cups. Dishwasher closed.",
)
print(json.dumps(plan, indent=2))

{ "steps": [ {"action": "open", "target": "dishwasher"}, {"action": "pick", "target": "plate_1"}, {"action": "place", "target": "dishwasher_rack"}, {"action": "pick", "target": "plate_2"}, {"action": "place", "target": "dishwasher_rack"}, {"action": "pick", "target": "plate_3"}, {"action": "place", "target": "dishwasher_rack"}, {"action": "pick", "target": "cup_1"}, {"action": "place", "target": "dishwasher_rack"}, {"action": "pick", "target": "cup_2"}, {"action": "place", "target": "dishwasher_rack"}, {"action": "close", "target": "dishwasher"} ] }

Code Fragment 28.7.1: Conceptual: LLM as robot task planner

2. Vision-Language-Action Models: The VLA Revolution

The progression from SayCan to modern VLA models traces a clear arc toward tighter integration between language understanding and physical control. SayCan kept the LLM and the robot's affordance model as separate modules, with the LLM proposing actions and a learned value function filtering them. RT-2 collapsed this into a single vision-language model that outputs discretized action tokens directly. But the field has moved considerably further since then.

Tip

When experimenting with VLA models for robotics, always start in simulation (Isaac Sim, MuJoCo) before going to real hardware. A single bug in an action prediction can physically damage a robot arm or its environment. Build a sim-to-real validation pipeline: if the model's predicted trajectory diverges more than a threshold from the sim-validated trajectory, halt execution and log the discrepancy for review.

OpenVLA: Open-Source VLA at Scale

OpenVLA (2024) demonstrated that a 7-billion-parameter open-source VLA can outperform the 55-billion-parameter RT-2-X on generalized robot manipulation benchmarks. Built on a Llama 2 backbone with a SigLIP vision encoder, OpenVLA was trained on the Open X-Embodiment dataset (described below) and released with fully open weights. This is significant because it shows that careful data curation and architecture design can compensate for raw parameter count, a pattern familiar from the efficient model discussions in Section 16.1.

pi0 and pi0.5: Flow Matching for Continuous Control

Physical Intelligence's pi0 (2024) and its successor pi0.5 (2025) take a fundamentally different approach to action generation. Instead of discretizing robot actions into tokens (as RT-2 and OpenVLA do), pi0 uses flow matching to produce continuous action trajectories. Flow matching learns a velocity field that transports a noise distribution to the target action distribution along straight paths, similar to how flow-matching image generators work (see Section 27.1). This enables smoother, more precise motor control, particularly for dexterous manipulation tasks like folding laundry or assembling objects. pi0.5 extends this with a hierarchical architecture: a high-level VLM plans task steps in language, while a low-level flow-matching policy generates the actual motor commands.

GR00T N1: Open Humanoid Foundation Model

NVIDIA's GR00T N1 (2025) is an open foundation model specifically designed for humanoid robots. It combines a vision-language backbone for scene understanding with a diffusion-based action head for whole-body control, including locomotion, manipulation, and coordination of dozens of joints simultaneously. GR00T N1 was trained in NVIDIA's Isaac Sim environment and supports zero-shot transfer to multiple humanoid platforms. Its release as an open model follows the pattern established by OpenVLA: making robot foundation models accessible to the broader research community rather than keeping them behind proprietary walls.

VLA Model Comparison

Model	Architecture	Action Representation	Training Data	Open Source
SayCan	Separate LLM + affordance model	Discrete primitives (pick, place)	Robot demos + LLM pretraining	No
RT-2	End-to-end VLM with action tokens	Discretized joint positions (256 bins)	Web data + robot episodes	No
OpenVLA	Llama 2 + SigLIP (7B params)	Discretized action tokens	Open X-Embodiment (970K episodes)	Yes
pi0 / pi0.5	VLM + flow-matching action head	Continuous trajectories via flow matching	Proprietary multi-task demonstrations	No
GR00T N1	Vision-language + diffusion action head	Continuous whole-body joint commands	Isaac Sim + real demonstrations	Yes

3. Foundation Models and Cross-Embodiment Transfer

A central challenge in robot learning is that data is scarce and expensive to collect. Every robot lab has its own hardware, environment, and task set. The Open X-Embodiment project (a collaboration of 33 research labs) addresses this by pooling over one million robot episodes from 22 different robot types into a single dataset. The motivation is directly analogous to LLM pretraining: just as GPT-3 benefits from training on diverse web text, a robot foundation model benefits from training on diverse robot experiences. The same principles of scale and diversity that drive language model performance (see Section 06.1) apply to robot data.

Generalist Policies: Octo and CrossFormer

Octo (2024) is a generalist robot policy trained on the Open X-Embodiment dataset. It uses a transformer architecture that processes both visual observations and language instructions, outputting actions for any robot represented in the training set. Crucially, Octo is designed for fine-tuning: you can take the pretrained policy and adapt it to a new robot with as few as 50 demonstrations. This mirrors the pretrain-then-fine-tune paradigm that dominates NLP (see Section 14.1 for fine-tuning fundamentals and Section 15.1 for parameter-efficient approaches).

CrossFormer extends this idea by using a modular architecture where robot-specific tokenizers handle the differences in observation and action spaces across embodiments, while a shared transformer backbone captures the common structure of manipulation tasks. This separation of embodiment-specific and embodiment-general components is conceptually similar to adapter-based fine-tuning in NLP, where a shared backbone is combined with small task-specific modules (as described in Section 15.2).

Key Insight

The analogy between robot foundation models and LLM pretraining runs deep. Both rely on large, diverse datasets to build general capabilities. Both use a pretrain-then-fine-tune workflow. Both benefit from scale, with larger models and more data yielding better generalization. The key difference is that robot data involves physical actions in 3D space rather than token sequences, which makes collection orders of magnitude more expensive per sample.

4. Sim-to-Real: LLMs in the Training Loop

Collecting real-world robot data is slow, expensive, and potentially dangerous. Simulation offers a way to generate virtually unlimited training data, but bridging the gap between simulated and real-world performance (the "sim-to-real" problem) has historically been a major bottleneck. LLMs are now being used to address multiple aspects of this challenge.

Eureka and DrEureka: LLM-Generated Reward Functions

Reward design is one of the hardest parts of robot reinforcement learning. Human experts spend weeks crafting reward functions for each new task, and small errors in the reward can lead to unexpected or unsafe behaviors. NVIDIA's Eureka (2023) uses an LLM to automatically generate reward functions from task descriptions. Given a text description like "make the robot hand rotate a pen," Eureka prompts GPT-4 to write Python reward functions, evaluates them in simulation, and iteratively refines the best candidates using the LLM's own analysis of the training curves. Eureka-generated rewards outperformed expert human-designed rewards on 83% of tasks tested, including achieving a new state-of-the-art for dexterous pen spinning. DrEureka extends this approach to handle domain randomization parameters, automatically tuning the simulation's physics to improve real-world transfer.

Key Insight

Eureka reveals something profound about LLMs: they can design reward functions that outperform human experts because they can iterate at machine speed. A human reward engineer might try 5 to 10 variations over a week, relying on intuition about what will produce the desired behavior. Eureka generates hundreds of candidates, evaluates them in simulation, and uses the training curves as feedback to refine the next batch. This is a meta-level application of the same generate-evaluate-refine loop that powers agentic reasoning (Section 22.3), but applied to the reward design process itself rather than to a single task.

Synthetic Training Data at Scale

NVIDIA's combination of Isaac Sim (a GPU-accelerated physics simulator) with Cosmos (a world foundation model) enables LLM-guided generation of massive synthetic training datasets. By using language prompts to specify scenarios ("a robot arm picking up a red cup from a cluttered desk"), the system can generate 780,000 training trajectories in just 11 hours. This approach directly applies the synthetic data generation principles from Chapter 13 to the robotics domain, where the stakes of data scarcity are even higher because real-world data collection involves physical hardware, safety constraints, and wall-clock time.

World Models: Text to Interactive 3D Environments

Genie 3 and similar world models take simulation generation one step further by creating interactive 3D environments from text or image prompts. Rather than rendering predefined simulation scenes, these models generate entirely new environments that a robot policy can interact with, providing a training distribution that is broader and more diverse than any hand-designed simulator. This is analogous to how LLMs generate diverse text for training data (see Section 13.2), but extended to 3D physical scenes with interactive dynamics.

5. Safety and Grounding in Physical Systems

When an LLM hallucinates in a text conversation, the consequence is a wrong answer. When an LLM hallucinates while controlling a robot, the consequence can be a broken object, a damaged machine, or an injured person. This fundamental difference makes safety not just important but essential in embodied AI systems. The safety principles from Chapter 32 apply with amplified urgency in physical settings.

The Symbol Grounding Problem

To understand why embodied AI is fundamentally harder than text-based AI, consider what happens when you tell a robot to "gently place the egg on the counter." The word "gently" carries enormous physical specificity that is completely absent from its token representation. For a human, "gently" activates a lifetime of sensorimotor experience: the feeling of fragile objects, the muscle memory of controlled deceleration, the visual feedback of watching something land softly. For an LLM, "gently" is a statistical relationship to other tokens. Bridging this gap, translating linguistic concepts into precise physical parameters, is the central challenge of embodied AI and the reason why robot foundation models need physical demonstration data in addition to text pretraining.

LLMs operate on symbols (words, tokens), but the physical world operates on continuous quantities (forces, positions, velocities). The symbol grounding problem asks: how do we ensure that the LLM's internal representations of concepts like "fragile," "heavy," or "sharp" correspond to the actual physical properties that matter for safe robot operation? An LLM might know that eggs are fragile (from text), but translating "fragile" into the correct grip force requires a grounding mechanism that connects linguistic knowledge to physical parameters. Current VLA models learn this grounding implicitly from demonstration data, but failures still occur when encountering objects or situations outside the training distribution.

RoboGuard: Formal Safety Constraints

RoboGuard (2024) addresses robot safety by combining LLM planning with formal temporal logic constraints. Instead of relying solely on the LLM's judgment about what is safe, RoboGuard translates safety requirements ("never move the arm above the table while carrying a glass of water," "always verify grip force before lifting") into signal temporal logic (STL) formulas. The robot's planned trajectory is then checked against these formal constraints before execution. If the plan violates any safety specification, it is rejected and the LLM is asked to re-plan. This approach provides mathematically verifiable safety guarantees rather than relying on the LLM's probabilistic understanding of safety, which can fail in subtle ways.

Warning

LLM hallucination in physical systems is not merely an accuracy problem; it is a safety problem. A robot that confidently executes a hallucinated plan can cause real damage. Always combine LLM-based planning with independent safety verification: affordance checking, force/torque limits, workspace boundaries, and formal constraint verification. The principle of defense in depth (multiple independent safety layers, described in Section 32.3) is especially critical when LLMs have physical agency.

6. Web Automation and Browser Agents

Web automation agents use LLMs to navigate websites, fill forms, click buttons, and complete tasks that normally require human interaction. These agents observe the page (through screenshots, accessibility trees, or DOM parsing), decide what action to take, execute it, and observe the result. This is the same agentic loop from Chapter 22, applied to browser environments. Code Fragment 28.7.5 below puts this into practice.

# Conceptual: web automation agent using browser tools
def web_agent_step(task: str, page_state: dict) -> dict:
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": """You are a web automation agent.
Given the current page state, decide the next action to complete
the task. Actions: click(selector), type(selector, text),
navigate(url), scroll(direction), wait(), done(result).
Return JSON with 'thought', 'action', and 'params'."""},
 {"role": "user", "content": f"""Task: {task}
Page title: {page_state['title']}
Interactive elements: {json.dumps(page_state['elements'])}"},
 ],
 response_format={"type": "json_object"},
 )
 return json.loads(response.choices[0].message.content)

**Proof by Mathematical Induction** **Base case (n = 1):** Left side: 1 Right side: 1(1+1)/2 = 1 The base case holds. **Inductive step:** Assume the formula holds for some k >= 1: 1 + 2 + ... + k = k(k+1)/2 (inductive hypothesis) We need to show it holds for k+1: 1 + 2 + ... + k + (k+1) = k(k+1)/2 + (k+1) = [k(k+1) + 2(k+1)] / 2 = (k+1)(k+2) / 2 = (k+1)((k+1)+1) / 2 This is exactly the formula with n = k+1. By induction, the formula holds for all positive integers n.

Code Fragment 28.7.2: Conceptual: web automation agent using browser tools

OS-Level Agents and Computer Use

OS-level agents extend web automation to the entire desktop. Anthropic's Computer Use API, for example, lets Claude interact with a computer through screenshots and mouse/keyboard actions. The agent observes the screen, reasons about what it sees, and executes actions like clicking buttons, typing text, or switching between applications. This capability enables automation of tasks that span multiple applications (like copying data from a spreadsheet to an email) without requiring application-specific APIs.

Agent Type Comparison

Agent Type	Environment	Observation	Actions	Example
Robot planner	Physical world	Camera, sensors	Pick, place, navigate	SayCan, RT-2
Web agent	Browser	DOM, screenshots	Click, type, navigate	WebArena, BrowserGym
OS agent	Desktop	Screenshots	Mouse, keyboard	Computer Use, OSWorld
Code agent	IDE / terminal	Files, outputs	Read, write, execute	Claude Code, Devin

Note

Evaluating agents that interact with real environments requires specialized benchmarks. WebArena tests web agents on realistic tasks (managing e-commerce sites, forums). OSWorld benchmarks OS-level agents on desktop tasks across operating systems. SQA (Situated Question Answering) tests robot understanding of physical environments. These benchmarks reveal that current agents succeed at simple, well-defined tasks but struggle with multi-step sequences, error recovery, and tasks requiring spatial reasoning or common sense about the physical world.

7. AI for Mathematics and Theorem Proving

LLMs are making significant inroads in mathematical reasoning and formal theorem proving. Google's AlphaProof and AlphaGeometry achieved silver-medal-equivalent performance at the 2024 International Mathematical Olympiad, solving 4 of 6 problems by combining LLMs for informal reasoning with formal verification systems for proof checking. These systems represent a new paradigm where AI augments mathematical discovery rather than just calculation. Code Fragment 28.7.3 below puts this into practice.

# Using an LLM for mathematical reasoning
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": """You are a mathematical reasoning assistant.
Work through problems step by step. Show all reasoning.
When uncertain, explore multiple approaches. Verify your
answer by checking boundary cases and special values."""},
 {"role": "user", "content": """Prove that for any positive integer n,
the sum 1 + 2 + ... + n = n(n+1)/2.
Use mathematical induction."""},
 ],
)
print(response.choices[0].message.content)

Code Fragment 28.7.3: Using an LLM for mathematical reasoning

8. Scientific Literature Mining and Hypothesis Generation

The scientific literature grows by millions of papers per year, making it impossible for any researcher to stay current even in a narrow field. LLMs can mine this literature to identify connections between findings, generate novel hypotheses, and suggest experimental designs. Systems like Semantic Scholar's AI-powered features and specialized scientific LLMs (Galactica, SciBERT) demonstrate how language models can accelerate the scientific discovery process. Figure 28.7.2 illustrates the AI-assisted scientific discovery pipeline.

Figure 28.7.2: AI-assisted scientific discovery. LLMs mine literature, generate hypotheses, and suggest experiments for researcher evaluation.

Key Insight

The common thread across robotics, web automation, and scientific discovery is the LLM's role as a reasoning and planning layer that sits above domain-specific execution systems. In robotics, the LLM plans while robot controllers execute. In web automation, the LLM decides while browser APIs act. In science, the LLM hypothesizes while experiments validate. This separation of reasoning (LLM strength) from execution (domain-specific strength) is a powerful architectural pattern that enables LLMs to extend into virtually any domain with appropriate grounding.

Warning

When LLMs control physical systems (robots, industrial equipment) or have broad computer access (OS agents), the consequences of errors become physical and potentially dangerous. A misplanned robot action can break objects or injure people. An OS agent with unrestricted access can delete files or send unauthorized messages. Robust safety measures are essential: action validation before execution, restricted action spaces, human approval for irreversible actions, sandbox environments for testing, and continuous monitoring of agent behavior.

Real-World Scenario: Browser Agents for Automated QA Testing

Who: Quality assurance team at a SaaS company with a complex web application

Situation: The application had 400+ pages and 2,000+ interactive elements. The QA team maintained 1,200 Selenium tests, but tests broke frequently after UI changes, and writing tests for new features took 2 to 3 days per feature.

Problem: Maintaining brittle CSS-selector-based tests consumed 40% of QA engineering time. Tests that relied on specific element IDs broke whenever the frontend framework was updated.

Decision: The team supplemented traditional tests with LLM-powered browser agents using Playwright and Claude's computer use capability. Instead of CSS selectors, agents navigated by visual understanding and natural language descriptions of UI elements.

How: Test cases were written as natural language scenarios: "Navigate to the billing page, change the payment method to a Visa ending in 4242, confirm the change, and verify the success message appears." The agent took screenshots at each step, used vision capabilities to identify UI elements, and executed actions through Playwright. A verification step compared the final state against expected outcomes described in natural language.

Result: Test maintenance time dropped by 70% because natural language test descriptions survived UI redesigns. New feature test coverage improved from 60% to 90% because writing natural language test cases was faster than coding Selenium scripts. The approach also caught visual regressions (misaligned elements, color changes) that traditional tests missed entirely.

Lesson: LLM-powered browser agents are most valuable when UI changes frequently. Natural language test specifications are more resilient than CSS selectors, and vision capabilities enable testing visual properties that traditional automation frameworks cannot assess.

Production Tip

Tools for building browser and OS agents (2024/2025). The browser agent ecosystem has matured significantly. Playwright MCP (Model Context Protocol server) lets LLMs control browsers through a standardized tool interface. Browser Use (open source, Python) provides a high-level agent framework that wraps Playwright with screenshot-based navigation and automatic element detection. Stagehand (by Browserbase) combines DOM parsing with vision models for more reliable element identification. For OS-level agents, Anthropic's computer use API enables Claude to take screenshots, move the mouse, click, and type, operating any desktop application. Apple's on-device models power Apple Intelligence features that understand app context and perform cross-app actions. For production deployments, always run browser agents in isolated containers (Docker, cloud VMs) to prevent unintended access to sensitive systems, and implement action allowlists that restrict which URLs the agent can visit and which form fields it can fill.

Self-Check

Q1: How does SayCan combine LLM knowledge with robot capabilities?

Show Answer

SayCan multiplies two scores for each candidate action: the LLM's assessment of how useful the action is for the current task (semantic grounding) and the affordance model's assessment of whether the robot can physically execute the action in the current state (physical grounding). The action with the highest combined score is selected, ensuring plans are both semantically sensible and physically feasible.

Q2: What makes RT-2 different from SayCan in its approach to robot planning?

Show Answer

SayCan uses separate models: an LLM for planning and affordance models for grounding, with actions selected from a predefined set. RT-2 is an end-to-end vision-language-action model that directly outputs robot action tokens from camera images and language instructions. By co-training on internet data and robot data, RT-2 can generalize to novel concepts without needing them to be in the predefined action set.

Q3: How do web automation agents observe and interact with web pages?

Show Answer

Web agents observe pages through multiple modalities: screenshots (visual understanding of layout), DOM/accessibility trees (structured representation of page elements), and HTML parsing (detailed element properties). They interact through browser actions: clicking elements, typing text, navigating URLs, scrolling, and waiting for page loads. The agent follows an observe-think-act loop, using the LLM to decide the next action based on the current page state and task progress.

Q4: How did AlphaProof and AlphaGeometry achieve mathematical reasoning?

Show Answer

AlphaProof uses an LLM for informal mathematical reasoning (generating proof ideas in natural language) combined with a formal verification system (Lean 4) that checks proof correctness rigorously. AlphaGeometry combines a neural language model for generating geometric constructions with a symbolic deduction engine that verifies each step. Both systems demonstrate that combining LLM creativity with formal verification produces stronger mathematical reasoning than either alone.

Q5: Why is the separation of reasoning and execution a powerful pattern for LLM applications?

Show Answer

Separating reasoning (what should be done) from execution (how to do it) lets LLMs leverage their strength in understanding, planning, and natural language while delegating precise actions to domain-specific systems optimized for execution. This pattern enables LLMs to extend into robotics, web automation, science, and other domains without needing to solve the full execution problem. It also improves safety, because execution systems can validate and constrain LLM-generated plans before acting.

Real-World Scenario: LLM-Guided Robotic Warehouse Picking

Who: Robotics engineering team at an e-commerce fulfillment center

Situation: The warehouse used robotic arms for picking items from bins, but the system failed on 15% of picks involving novel items not seen during training (new products added weekly).

Problem: Traditional vision-based grasping models required retraining whenever new product categories were introduced, creating a 2-week lag between product listing and automated fulfillment capability.

Dilemma: Retraining the grasping model weekly was expensive and risked degrading performance on existing categories. Manual picking for novel items was a bottleneck that negated the efficiency gains from automation.

Decision: The team integrated a VLM (vision-language model) into the picking pipeline using a SayCan-inspired approach: the VLM identified objects and suggested grasp strategies, while an affordance model scored which strategies were physically feasible for the robot.

How: The VLM received a camera image of the bin and a text description of the target item, then proposed grasp approaches (top, side, pinch). The affordance model filtered proposals based on the robot's physical capabilities and bin geometry. A safety layer prevented picks when confidence was below threshold, routing those items to a human picker.

Result: Novel item pick success rate improved from 85% to 96%. The 2-week retraining lag was eliminated because the VLM generalized to new products from their text descriptions. The safety threshold kept the failure rate for attempted picks below 1%.

Lesson: Grounding LLM reasoning in physical affordances (what the robot can actually do) bridges the gap between semantic understanding and reliable physical manipulation.

Key Takeaways

SayCan grounds LLM plans in robot capabilities by combining semantic understanding with physical affordance scoring.
RT-2 achieves end-to-end vision-language-action reasoning, generalizing to concepts not seen during robot training.
VLA models (OpenVLA, pi0, GR00T N1) have rapidly advanced, with open-source 7B models matching or exceeding proprietary 55B systems, and flow-matching action heads enabling continuous, dexterous control.
Cross-embodiment transfer via the Open X-Embodiment dataset and generalist policies like Octo mirrors the pretrain-then-fine-tune paradigm from NLP.
Sim-to-real is being transformed by LLM-generated reward functions (Eureka) and LLM-guided synthetic data generation, enabling hundreds of thousands of training trajectories in hours rather than months.
Physical safety requires defense in depth: affordance checking, formal temporal logic constraints (RoboGuard), and independent verification layers beyond the LLM's own judgment.
Web automation agents use observe-think-act loops with DOM and screenshot observations to navigate and interact with websites.
OS-level agents (Computer Use) extend automation beyond browsers to the full desktop, enabling cross-application workflows.
AI for mathematics combines LLM informal reasoning with formal verification, achieving breakthrough results on competition-level problems.
Scientific discovery benefits from LLM literature mining, hypothesis generation, and experiment design, with human researchers validating and testing AI-generated ideas.

Research Frontier

World models for embodied AI are a major research direction. Systems like Google DeepMind's Genie 2 and Meta's V-JEPA learn predictive world models that enable robots to simulate the consequences of actions before executing them physically.

In scientific discovery, AI systems are beginning to close the loop: generating hypotheses, designing experiments, running them through automated labs, and analyzing results to refine hypotheses iteratively. The convergence of LLM reasoning with robotic manipulation, automated labs, and formal verification systems points toward increasingly autonomous scientific and engineering agents.

Exercises

Exercise 28.7.1: LLM as Robot Planner Conceptual

Explain how an LLM can serve as a high-level planner for a robot. What is the role of the LLM versus the low-level controller, and what are the challenges of grounding language in physical actions?

Answer Sketch

The LLM takes a natural language instruction (e.g., 'make a sandwich') and decomposes it into a sequence of primitive actions (open fridge, pick up bread, etc.). The low-level controller translates each primitive into motor commands. Challenges: the LLM may generate physically impossible actions, it does not understand the robot's actual capabilities, and language is ambiguous (e.g., 'put it there' requires spatial grounding). Solutions: affordance functions that filter LLM suggestions based on what is physically possible.

Exercise 28.7.2: Vision-Language-Action Models Conceptual

Describe the architecture of a Vision-Language-Action (VLA) model like RT-2 or pi0. How does it differ from using separate vision, language, and action models?

Answer Sketch

A VLA model processes visual observations, language instructions, and action history in a single transformer. Actions are tokenized and predicted as part of the same token sequence as language. This differs from separate models because cross-modal interactions happen at every layer, allowing the model to ground language in visual observations and produce actions that are spatially and temporally coherent. Separate models require explicit translation between representations at each handoff.

Exercise 28.7.3: Sim-to-Real Transfer Conceptual

Explain the sim-to-real gap in robotics and how LLMs are being used to help bridge it. What role can LLMs play in generating simulation scenarios?

Answer Sketch

The sim-to-real gap: policies trained in simulation often fail in the real world due to differences in physics, visual appearance, and sensor noise. LLMs help by: (1) generating diverse task descriptions that drive simulation scenario creation, (2) producing reward functions from natural language specifications, and (3) creating domain randomization parameters. This increases the diversity of training scenarios, making policies more robust to real-world variations.

Exercise 28.7.4: Scientific Literature Mining Coding

Write a Python function that takes a research topic, searches arXiv for relevant papers, extracts key findings from each abstract, and produces a structured literature review.

Answer Sketch

Use the arXiv API to search for papers by topic. For each result, extract: title, authors, abstract, date. Send each abstract to an LLM to extract: main contribution, methodology, key results, and limitations. Group papers by theme using LLM-based clustering. Produce a structured review with sections: background, methods, findings, open questions. Include proper citations for every claim.

Exercise 28.7.5: AI for Mathematics Discussion

Discuss the current capabilities and limitations of AI systems for mathematical reasoning. Can LLMs prove theorems, and if so, how do their approaches differ from human mathematicians?

Answer Sketch

Current capabilities: LLMs can solve competition-level math problems (IMO 2024: AlphaProof solved 4 of 6 problems), generate and verify proofs in formal languages (Lean, Isabelle), and discover new patterns. Limitations: LLMs still struggle with multi-step logical reasoning, can produce plausible but incorrect proofs, and lack the creative intuition that guides human mathematicians toward interesting conjectures. The most effective approaches combine LLMs with formal verification systems that guarantee correctness.

Lab: Speech-to-Text Pipeline with LLM Summarization

Duration: ~45 minutes Intermediate

Objective

Build an end-to-end audio processing pipeline that transcribes speech using OpenAI Whisper, then chains the transcript into an LLM for summarization. You will first implement a manual transcription-to-summary pipeline (the "right tool" approach), then streamline it with the transformers pipeline API.

What You'll Practice

Loading and preprocessing audio files for Whisper transcription
Running speech-to-text inference with the Whisper model
Chaining transcription output into an LLM summarization step
Comparing manual pipeline construction with the transformers pipeline shortcut

Setup

Install the required packages. Whisper runs on CPU but benefits greatly from GPU acceleration.

pip install openai-whisper transformers torch torchaudio

Audio loaded: 12.4s at 8000 Hz Channels: 1, Samples: 99200

Code Fragment 28.7.4: Code example

Steps

Step 1: Prepare an audio sample

Load a sample audio file. You can use any WAV or MP3 file, or generate a short test clip with text-to-speech.

import torch
import torchaudio
import urllib.request
import os

# Download a sample audio clip (LibriSpeech test sample)
audio_url = "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav"
audio_path = "sample_speech.wav"
if not os.path.exists(audio_path):
 urllib.request.urlretrieve(audio_url, audio_path)

waveform, sample_rate = torchaudio.load(audio_path)
duration = waveform.shape[1] / sample_rate
print(f"Audio loaded: {duration:.1f}s at {sample_rate} Hz")
print(f"Channels: {waveform.shape[0]}, Samples: {waveform.shape[1]}")

Code Fragment 28.7.5: Working with urllib, torch, torchaudio

Step 2: Transcribe with Whisper (from-scratch approach)

Load the Whisper model directly and run transcription. This manual approach gives you full control over model size, language detection, and decoding parameters.

import whisper

# Load a small model (choose "base", "small", "medium", or "large")
model = whisper.load_model("base")

# Transcribe the audio file
result = model.transcribe(audio_path, fp16=torch.cuda.is_available())

transcript = result["text"].strip()
language = result.get("language", "unknown")

print(f"Detected language: {language}")
print(f"Transcript ({len(transcript)} chars):")
print(f" {transcript[:500]}")

# Inspect segment-level timestamps
for seg in result["segments"][:5]:
 start, end = seg["start"], seg["end"]
 print(f" [{start:.1f}s - {end:.1f}s] {seg['text'].strip()}")

Detected language: en Transcript (187 chars): The birch canoe slid on the smooth planks. Glue the sheet to the dark blue background. [0.0s - 3.2s] The birch canoe slid on the smooth planks. [3.2s - 6.8s] Glue the sheet to the dark blue background. ...

Code Fragment 28.7.6: Load a small model (choose "base", "small", "medium", or "large")

Hint

The base model is fast but less accurate. For production use, small or medium offer a better accuracy-to-speed tradeoff. The fp16 flag enables half-precision on GPU for faster inference.

Step 3: Summarize the transcript with an LLM

Chain the Whisper output into a text summarization model. This demonstrates the common pattern of using a specialized model (Whisper) for perception, then an LLM for reasoning.

from transformers import pipeline

# Load a summarization model
summarizer = pipeline(
 "summarization",
 model="facebook/bart-large-cnn",
 device=0 if torch.cuda.is_available() else -1,
)

# Summarize the transcript
# BART-CNN has a 1024-token limit; truncate if needed
max_chars = 3000
input_text = transcript[:max_chars]

summary = summarizer(
 input_text,
 max_length=150,
 min_length=30,
 do_sample=False,
)

print("Original transcript length:", len(transcript), "chars")
print("Summary:")
print(f" {summary[0]['summary_text']}")

Original transcript length: 187 chars Summary: A series of short spoken instructions about sliding a birch canoe on planks and gluing a sheet to a blue background.

Code Fragment 28.7.7: Load a summarization model

Step 4: Streamline with the transformers pipeline

Now use the transformers automatic speech recognition pipeline as a shortcut, comparing it with the manual Whisper approach above.

from transformers import pipeline

# One-liner ASR pipeline using Whisper
asr_pipeline = pipeline(
 "automatic-speech-recognition",
 model="openai/whisper-base",
 device=0 if torch.cuda.is_available() else -1,
)

# Transcribe with the pipeline API
asr_result = asr_pipeline(audio_path)
pipeline_transcript = asr_result["text"].strip()

# Compare outputs
print("Manual Whisper transcript:")
print(f" {transcript[:200]}...")
print()
print("Pipeline transcript:")
print(f" {pipeline_transcript[:200]}...")
print()

# Check similarity
from difflib import SequenceMatcher
similarity = SequenceMatcher(
 None, transcript.lower(), pipeline_transcript.lower()
).ratio()
print(f"Transcript similarity: {similarity:.2%}")

Manual Whisper transcript: The birch canoe slid on the smooth planks. Glue the sheet to the dark blue background... Pipeline transcript: The birch canoe slid on the smooth planks. Glue the sheet to the dark blue background... Transcript similarity: 98.74%

Code Fragment 28.7.8: One-liner ASR pipeline using Whisper

Extensions

Add speaker diarization using pyannote-audio to identify who is speaking in multi-speaker recordings.
Replace the BART summarizer with an instruction-tuned LLM (e.g., Flan-T5) and compare summary quality using ROUGE scores.
Build a real-time streaming version that transcribes audio chunks as they arrive, using Whisper's segment-based decoding.

What Comes Next

In the next chapter, Chapter 29: Evaluation, Experiment Design & Observability, we turn to evaluation and observability, the practices that ensure LLM applications work reliably in production.

Bibliography

Robot Planning

Ahn, M., Brohan, A., Brown, N., et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan)." arXiv:2204.01691

Introduces the paradigm of combining LLM world knowledge with robotic affordances for grounded task planning. The multiplication of language scores and affordance scores produces executable plans. Foundational paper for anyone working on LLM-powered robotics.

Robot Planning

Vision-Language-Action

Brohan, A., Brown, N., Carbajal, J., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818

Shows that large vision-language models can be fine-tuned to output robot actions directly, transferring web-scale knowledge to physical manipulation. Demonstrates emergent capabilities like following novel instructions and reasoning about object properties. Essential for understanding how foundation models connect to physical actuators.

Vision-Language-Action

Vision-Language-Action Models

Kim, M.J., Pertsch, K., Karamcheti, S., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246

Introduces a 7B open-source VLA built on Llama 2 and SigLIP that outperforms the 55B RT-2-X on generalized manipulation benchmarks. Trained on Open X-Embodiment data with full weight release. Essential reading for teams building or fine-tuning robot foundation models without proprietary access.

Vision-Language-Action Models

Black, K., Brown, N., Driess, D., et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence Technical Report

Demonstrates flow matching as an alternative to action tokenization for robot control, enabling smooth continuous trajectories for dexterous manipulation tasks including laundry folding. Shows how flow-matching generation techniques from image synthesis transfer to the action domain.

Vision-Language-Action Models

NVIDIA. (2025). "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." NVIDIA Developer Blog

NVIDIA's open humanoid foundation model combining vision-language understanding with diffusion-based whole-body control. Supports zero-shot transfer across humanoid platforms. Important for researchers and developers working on humanoid robotics with open tooling.

Vision-Language-Action Models

Cross-Embodiment Transfer

Open X-Embodiment Collaboration. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864

A collaboration of 33 labs pooling over one million robot episodes from 22 robot types into a shared dataset. Demonstrates that policies trained on diverse cross-embodiment data outperform those trained on any single lab's data. The robotics equivalent of large-scale pretraining corpora.

Cross-Embodiment Transfer

Team, Octo Model, et al. (2024). "Octo: An Open-Source Generalist Robot Policy." arXiv:2405.12213

Presents a generalist robot policy trained on Open X-Embodiment that can be fine-tuned to new robots with as few as 50 demonstrations. Establishes the pretrain-then-fine-tune paradigm for robot learning, directly analogous to NLP foundation models.

Cross-Embodiment Transfer

Sim-to-Real and Reward Design

Ma, Y.J., Liang, W., Wang, G., et al. (2023). "Eureka: Human-Level Reward Design via Coding Large Language Models." arXiv:2310.12931

Uses GPT-4 to automatically generate and iteratively refine reward functions for robot reinforcement learning, outperforming human expert rewards on 83% of tasks. Demonstrates that LLMs can accelerate the most labor-intensive part of robot RL. Important for anyone using RL for robot training.

Sim-to-Real and Reward Design

Chi, C., Feng, S., Du, Y., et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." arXiv:2303.04137

Introduces diffusion models for generating robot action sequences, enabling multimodal action distributions and smooth trajectories. The foundation for flow-matching approaches like pi0. Required reading for understanding the generative modeling perspective on robot control.

Sim-to-Real and Reward Design

Robot Safety

Tong, A., Zheng, Z., Bauer, J., et al. (2024). "RoboGuard: A Guardrail Architecture for Safety-Critical Robotic Systems." arXiv:2403.02847

Proposes combining LLM planning with signal temporal logic constraints for verifiable robot safety. Plans are checked against formal specifications before execution, providing mathematical guarantees rather than probabilistic safety. Essential for teams deploying LLM-controlled robots in safety-critical environments.

Robot Safety

Web Agents

Zheng, S., Feng, X., Liao, Z., et al. (2024). "GPT-4V(ision) is a Generalist Web Agent, if Grounded." arXiv:2401.01614

Demonstrates that GPT-4V can act as a web automation agent by observing screenshots and generating UI actions, when properly grounded with element coordinates. Covers the set-of-marks prompting technique for visual grounding. Important for teams building browser automation and web agent systems.

Web Agents

Scientific Discovery

Trinh, T.H. & Le, Q.V. (2024). "Solving Olympiad Geometry without Human Demonstrations (AlphaGeometry)." Nature, 625, 476-482

Combines a neural language model with a symbolic deduction engine to solve International Mathematical Olympiad geometry problems at near-gold-medal level. Published in Nature, this demonstrates how neuro-symbolic approaches can tackle formal mathematical reasoning. Critical for researchers exploring AI for mathematical discovery.

Scientific Discovery

Si, C., Yang, D., & Hashimoto, T. (2024). "Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers." arXiv:2409.04109

A rigorous study with 100+ NLP researchers evaluating whether LLM-generated research ideas are novel and feasible, finding they are rated as more novel but less feasible than human ideas. Provides the experimental methodology for evaluating AI scientific creativity. Important for the scientific community assessing AI as a research collaborator.

Scientific Discovery