"I can reason about the world with words. Now give me a body, and I will reason about it with actions."
Deploy, Freshly Embodied AI Agent
LLMs are extending from digital text into the physical world and the frontiers of science. In robotics, LLMs serve as high-level planners that translate natural language instructions into sequences of robot actions. In web and OS automation, they operate as agents that navigate interfaces, fill forms, and complete tasks on behalf of users. In scientific discovery, they mine literature, generate hypotheses, prove theorems, and design experiments. These applications represent the cutting edge of what LLMs can do when connected to real-world actuators and scientific knowledge. The agent planning and tool use patterns from Section 22.3 and Section 22.2 are the core enabling capabilities.
Prerequisites
This section requires familiarity with the application patterns from Section 28.1 through Section 28.6. Understanding evaluation basics from Section 29.1 will help with assessing the production readiness of LLM applications.
1. LLMs as Robot Planners
The key insight behind using LLMs for robotics is that language models possess extensive world knowledge about objects, their properties, and how they relate to each other.
When Google first tested SayCan, the robot correctly understood "I spilled my drink, can you help?" and fetched a sponge. It also tried to "clean up" a coworker's lunch. World knowledge, it turns out, needs better boundaries.
A human saying "make me a sandwich" implies a sequence of actions (get bread, get ingredients, assemble, plate) that an LLM can decompose into steps. The challenge is grounding these steps in the robot's actual physical capabilities and environment.
SayCan: Grounding Language in Robot Actions
Google's SayCan combines an LLM's knowledge of what makes sense to do with a robot's learned affordances (what it can physically do). The LLM proposes candidate next actions, and a value function scores each action based on whether the robot can actually execute it in the current state. This product of "what should I do" (LLM) and "what can I do" (affordance model) produces grounded action plans that are both semantically correct and physically feasible. Figure 28.7.1 shows the SayCan architecture.
RT-2: Vision-Language-Action Models
Google's RT-2 (Robotics Transformer 2) takes grounding further by training a single vision-language model that directly outputs robot actions. The model processes camera images and language instructions and outputs discretized action tokens (arm positions, gripper states). By co-training on both internet-scale vision-language data and robot demonstration data, RT-2 acquires emergent reasoning capabilities: it can follow instructions involving concepts never seen during robot training (like "move the object to the picture of a country" by recognizing flags). Code Fragment 28.7.2 below puts this into practice.
# Conceptual: LLM as robot task planner
from openai import OpenAI
import json
client = OpenAI()
def plan_robot_actions(
instruction: str,
available_actions: list,
scene_description: str,
) -> list:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""You are a robot task planner.
Available primitive actions: {json.dumps(available_actions)}
Current scene: {scene_description}
Decompose the instruction into a sequence of available actions.
Return a JSON array of action steps with 'action' and 'target'."""},
{"role": "user", "content": instruction},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
plan = plan_robot_actions(
instruction="Put the dirty dishes in the dishwasher",
available_actions=["pick", "place", "open", "close", "navigate"],
scene_description="Kitchen counter with 3 plates and 2 cups. Dishwasher closed.",
)
print(json.dumps(plan, indent=2))
2. Vision-Language-Action Models: The VLA Revolution
The progression from SayCan to modern VLA models traces a clear arc toward tighter integration between language understanding and physical control. SayCan kept the LLM and the robot's affordance model as separate modules, with the LLM proposing actions and a learned value function filtering them. RT-2 collapsed this into a single vision-language model that outputs discretized action tokens directly. But the field has moved considerably further since then.
When experimenting with VLA models for robotics, always start in simulation (Isaac Sim, MuJoCo) before going to real hardware. A single bug in an action prediction can physically damage a robot arm or its environment. Build a sim-to-real validation pipeline: if the model's predicted trajectory diverges more than a threshold from the sim-validated trajectory, halt execution and log the discrepancy for review.
OpenVLA: Open-Source VLA at Scale
OpenVLA (2024) demonstrated that a 7-billion-parameter open-source VLA can outperform the 55-billion-parameter RT-2-X on generalized robot manipulation benchmarks. Built on a Llama 2 backbone with a SigLIP vision encoder, OpenVLA was trained on the Open X-Embodiment dataset (described below) and released with fully open weights. This is significant because it shows that careful data curation and architecture design can compensate for raw parameter count, a pattern familiar from the efficient model discussions in Section 16.1.
pi0 and pi0.5: Flow Matching for Continuous Control
Physical Intelligence's pi0 (2024) and its successor pi0.5 (2025) take a fundamentally different approach to action generation. Instead of discretizing robot actions into tokens (as RT-2 and OpenVLA do), pi0 uses flow matching to produce continuous action trajectories. Flow matching learns a velocity field that transports a noise distribution to the target action distribution along straight paths, similar to how flow-matching image generators work (see Section 27.1). This enables smoother, more precise motor control, particularly for dexterous manipulation tasks like folding laundry or assembling objects. pi0.5 extends this with a hierarchical architecture: a high-level VLM plans task steps in language, while a low-level flow-matching policy generates the actual motor commands.
GR00T N1: Open Humanoid Foundation Model
NVIDIA's GR00T N1 (2025) is an open foundation model specifically designed for humanoid robots. It combines a vision-language backbone for scene understanding with a diffusion-based action head for whole-body control, including locomotion, manipulation, and coordination of dozens of joints simultaneously. GR00T N1 was trained in NVIDIA's Isaac Sim environment and supports zero-shot transfer to multiple humanoid platforms. Its release as an open model follows the pattern established by OpenVLA: making robot foundation models accessible to the broader research community rather than keeping them behind proprietary walls.
VLA Model Comparison
| Model | Architecture | Action Representation | Training Data | Open Source |
|---|---|---|---|---|
| SayCan | Separate LLM + affordance model | Discrete primitives (pick, place) | Robot demos + LLM pretraining | No |
| RT-2 | End-to-end VLM with action tokens | Discretized joint positions (256 bins) | Web data + robot episodes | No |
| OpenVLA | Llama 2 + SigLIP (7B params) | Discretized action tokens | Open X-Embodiment (970K episodes) | Yes |
| pi0 / pi0.5 | VLM + flow-matching action head | Continuous trajectories via flow matching | Proprietary multi-task demonstrations | No |
| GR00T N1 | Vision-language + diffusion action head | Continuous whole-body joint commands | Isaac Sim + real demonstrations | Yes |
3. Foundation Models and Cross-Embodiment Transfer
A central challenge in robot learning is that data is scarce and expensive to collect. Every robot lab has its own hardware, environment, and task set. The Open X-Embodiment project (a collaboration of 33 research labs) addresses this by pooling over one million robot episodes from 22 different robot types into a single dataset. The motivation is directly analogous to LLM pretraining: just as GPT-3 benefits from training on diverse web text, a robot foundation model benefits from training on diverse robot experiences. The same principles of scale and diversity that drive language model performance (see Section 06.1) apply to robot data.
Generalist Policies: Octo and CrossFormer
Octo (2024) is a generalist robot policy trained on the Open X-Embodiment dataset. It uses a transformer architecture that processes both visual observations and language instructions, outputting actions for any robot represented in the training set. Crucially, Octo is designed for fine-tuning: you can take the pretrained policy and adapt it to a new robot with as few as 50 demonstrations. This mirrors the pretrain-then-fine-tune paradigm that dominates NLP (see Section 14.1 for fine-tuning fundamentals and Section 15.1 for parameter-efficient approaches).
CrossFormer extends this idea by using a modular architecture where robot-specific tokenizers handle the differences in observation and action spaces across embodiments, while a shared transformer backbone captures the common structure of manipulation tasks. This separation of embodiment-specific and embodiment-general components is conceptually similar to adapter-based fine-tuning in NLP, where a shared backbone is combined with small task-specific modules (as described in Section 15.2).
The analogy between robot foundation models and LLM pretraining runs deep. Both rely on large, diverse datasets to build general capabilities. Both use a pretrain-then-fine-tune workflow. Both benefit from scale, with larger models and more data yielding better generalization. The key difference is that robot data involves physical actions in 3D space rather than token sequences, which makes collection orders of magnitude more expensive per sample.
4. Sim-to-Real: LLMs in the Training Loop
Collecting real-world robot data is slow, expensive, and potentially dangerous. Simulation offers a way to generate virtually unlimited training data, but bridging the gap between simulated and real-world performance (the "sim-to-real" problem) has historically been a major bottleneck. LLMs are now being used to address multiple aspects of this challenge.
Eureka and DrEureka: LLM-Generated Reward Functions
Reward design is one of the hardest parts of robot reinforcement learning. Human experts spend weeks crafting reward functions for each new task, and small errors in the reward can lead to unexpected or unsafe behaviors. NVIDIA's Eureka (2023) uses an LLM to automatically generate reward functions from task descriptions. Given a text description like "make the robot hand rotate a pen," Eureka prompts GPT-4 to write Python reward functions, evaluates them in simulation, and iteratively refines the best candidates using the LLM's own analysis of the training curves. Eureka-generated rewards outperformed expert human-designed rewards on 83% of tasks tested, including achieving a new state-of-the-art for dexterous pen spinning. DrEureka extends this approach to handle domain randomization parameters, automatically tuning the simulation's physics to improve real-world transfer.
Eureka reveals something profound about LLMs: they can design reward functions that outperform human experts because they can iterate at machine speed. A human reward engineer might try 5 to 10 variations over a week, relying on intuition about what will produce the desired behavior. Eureka generates hundreds of candidates, evaluates them in simulation, and uses the training curves as feedback to refine the next batch. This is a meta-level application of the same generate-evaluate-refine loop that powers agentic reasoning (Section 22.3), but applied to the reward design process itself rather than to a single task.
Synthetic Training Data at Scale
NVIDIA's combination of Isaac Sim (a GPU-accelerated physics simulator) with Cosmos (a world foundation model) enables LLM-guided generation of massive synthetic training datasets. By using language prompts to specify scenarios ("a robot arm picking up a red cup from a cluttered desk"), the system can generate 780,000 training trajectories in just 11 hours. This approach directly applies the synthetic data generation principles from Chapter 13 to the robotics domain, where the stakes of data scarcity are even higher because real-world data collection involves physical hardware, safety constraints, and wall-clock time.
World Models: Text to Interactive 3D Environments
Genie 3 and similar world models take simulation generation one step further by creating interactive 3D environments from text or image prompts. Rather than rendering predefined simulation scenes, these models generate entirely new environments that a robot policy can interact with, providing a training distribution that is broader and more diverse than any hand-designed simulator. This is analogous to how LLMs generate diverse text for training data (see Section 13.2), but extended to 3D physical scenes with interactive dynamics.
5. Safety and Grounding in Physical Systems
When an LLM hallucinates in a text conversation, the consequence is a wrong answer. When an LLM hallucinates while controlling a robot, the consequence can be a broken object, a damaged machine, or an injured person. This fundamental difference makes safety not just important but essential in embodied AI systems. The safety principles from Chapter 32 apply with amplified urgency in physical settings.
The Symbol Grounding Problem
To understand why embodied AI is fundamentally harder than text-based AI, consider what happens when you tell a robot to "gently place the egg on the counter." The word "gently" carries enormous physical specificity that is completely absent from its token representation. For a human, "gently" activates a lifetime of sensorimotor experience: the feeling of fragile objects, the muscle memory of controlled deceleration, the visual feedback of watching something land softly. For an LLM, "gently" is a statistical relationship to other tokens. Bridging this gap, translating linguistic concepts into precise physical parameters, is the central challenge of embodied AI and the reason why robot foundation models need physical demonstration data in addition to text pretraining.
LLMs operate on symbols (words, tokens), but the physical world operates on continuous quantities (forces, positions, velocities). The symbol grounding problem asks: how do we ensure that the LLM's internal representations of concepts like "fragile," "heavy," or "sharp" correspond to the actual physical properties that matter for safe robot operation? An LLM might know that eggs are fragile (from text), but translating "fragile" into the correct grip force requires a grounding mechanism that connects linguistic knowledge to physical parameters. Current VLA models learn this grounding implicitly from demonstration data, but failures still occur when encountering objects or situations outside the training distribution.
RoboGuard: Formal Safety Constraints
RoboGuard (2024) addresses robot safety by combining LLM planning with formal temporal logic constraints. Instead of relying solely on the LLM's judgment about what is safe, RoboGuard translates safety requirements ("never move the arm above the table while carrying a glass of water," "always verify grip force before lifting") into signal temporal logic (STL) formulas. The robot's planned trajectory is then checked against these formal constraints before execution. If the plan violates any safety specification, it is rejected and the LLM is asked to re-plan. This approach provides mathematically verifiable safety guarantees rather than relying on the LLM's probabilistic understanding of safety, which can fail in subtle ways.
LLM hallucination in physical systems is not merely an accuracy problem; it is a safety problem. A robot that confidently executes a hallucinated plan can cause real damage. Always combine LLM-based planning with independent safety verification: affordance checking, force/torque limits, workspace boundaries, and formal constraint verification. The principle of defense in depth (multiple independent safety layers, described in Section 32.3) is especially critical when LLMs have physical agency.
6. Web Automation and Browser Agents
Web automation agents use LLMs to navigate websites, fill forms, click buttons, and complete tasks that normally require human interaction. These agents observe the page (through screenshots, accessibility trees, or DOM parsing), decide what action to take, execute it, and observe the result. This is the same agentic loop from Chapter 22, applied to browser environments. Code Fragment 28.7.5 below puts this into practice.
# Conceptual: web automation agent using browser tools
def web_agent_step(task: str, page_state: dict) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """You are a web automation agent.
Given the current page state, decide the next action to complete
the task. Actions: click(selector), type(selector, text),
navigate(url), scroll(direction), wait(), done(result).
Return JSON with 'thought', 'action', and 'params'."""},
{"role": "user", "content": f"""Task: {task}
Page title: {page_state['title']}
Interactive elements: {json.dumps(page_state['elements'])}"},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
OS-Level Agents and Computer Use
OS-level agents extend web automation to the entire desktop. Anthropic's Computer Use API, for example, lets Claude interact with a computer through screenshots and mouse/keyboard actions. The agent observes the screen, reasons about what it sees, and executes actions like clicking buttons, typing text, or switching between applications. This capability enables automation of tasks that span multiple applications (like copying data from a spreadsheet to an email) without requiring application-specific APIs.
| Agent Type | Environment | Observation | Actions | Example |
|---|---|---|---|---|
| Robot planner | Physical world | Camera, sensors | Pick, place, navigate | SayCan, RT-2 |
| Web agent | Browser | DOM, screenshots | Click, type, navigate | WebArena, BrowserGym |
| OS agent | Desktop | Screenshots | Mouse, keyboard | Computer Use, OSWorld |
| Code agent | IDE / terminal | Files, outputs | Read, write, execute | Claude Code, Devin |
Evaluating agents that interact with real environments requires specialized benchmarks. WebArena tests web agents on realistic tasks (managing e-commerce sites, forums). OSWorld benchmarks OS-level agents on desktop tasks across operating systems. SQA (Situated Question Answering) tests robot understanding of physical environments. These benchmarks reveal that current agents succeed at simple, well-defined tasks but struggle with multi-step sequences, error recovery, and tasks requiring spatial reasoning or common sense about the physical world.
7. AI for Mathematics and Theorem Proving
LLMs are making significant inroads in mathematical reasoning and formal theorem proving. Google's AlphaProof and AlphaGeometry achieved silver-medal-equivalent performance at the 2024 International Mathematical Olympiad, solving 4 of 6 problems by combining LLMs for informal reasoning with formal verification systems for proof checking. These systems represent a new paradigm where AI augments mathematical discovery rather than just calculation. Code Fragment 28.7.3 below puts this into practice.
# Using an LLM for mathematical reasoning
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """You are a mathematical reasoning assistant.
Work through problems step by step. Show all reasoning.
When uncertain, explore multiple approaches. Verify your
answer by checking boundary cases and special values."""},
{"role": "user", "content": """Prove that for any positive integer n,
the sum 1 + 2 + ... + n = n(n+1)/2.
Use mathematical induction."""},
],
)
print(response.choices[0].message.content)
8. Scientific Literature Mining and Hypothesis Generation
The scientific literature grows by millions of papers per year, making it impossible for any researcher to stay current even in a narrow field. LLMs can mine this literature to identify connections between findings, generate novel hypotheses, and suggest experimental designs. Systems like Semantic Scholar's AI-powered features and specialized scientific LLMs (Galactica, SciBERT) demonstrate how language models can accelerate the scientific discovery process. Figure 28.7.2 illustrates the AI-assisted scientific discovery pipeline.
The common thread across robotics, web automation, and scientific discovery is the LLM's role as a reasoning and planning layer that sits above domain-specific execution systems. In robotics, the LLM plans while robot controllers execute. In web automation, the LLM decides while browser APIs act. In science, the LLM hypothesizes while experiments validate. This separation of reasoning (LLM strength) from execution (domain-specific strength) is a powerful architectural pattern that enables LLMs to extend into virtually any domain with appropriate grounding.
When LLMs control physical systems (robots, industrial equipment) or have broad computer access (OS agents), the consequences of errors become physical and potentially dangerous. A misplanned robot action can break objects or injure people. An OS agent with unrestricted access can delete files or send unauthorized messages. Robust safety measures are essential: action validation before execution, restricted action spaces, human approval for irreversible actions, sandbox environments for testing, and continuous monitoring of agent behavior.
Who: Quality assurance team at a SaaS company with a complex web application
Situation: The application had 400+ pages and 2,000+ interactive elements. The QA team maintained 1,200 Selenium tests, but tests broke frequently after UI changes, and writing tests for new features took 2 to 3 days per feature.
Problem: Maintaining brittle CSS-selector-based tests consumed 40% of QA engineering time. Tests that relied on specific element IDs broke whenever the frontend framework was updated.
Decision: The team supplemented traditional tests with LLM-powered browser agents using Playwright and Claude's computer use capability. Instead of CSS selectors, agents navigated by visual understanding and natural language descriptions of UI elements.
How: Test cases were written as natural language scenarios: "Navigate to the billing page, change the payment method to a Visa ending in 4242, confirm the change, and verify the success message appears." The agent took screenshots at each step, used vision capabilities to identify UI elements, and executed actions through Playwright. A verification step compared the final state against expected outcomes described in natural language.
Result: Test maintenance time dropped by 70% because natural language test descriptions survived UI redesigns. New feature test coverage improved from 60% to 90% because writing natural language test cases was faster than coding Selenium scripts. The approach also caught visual regressions (misaligned elements, color changes) that traditional tests missed entirely.
Lesson: LLM-powered browser agents are most valuable when UI changes frequently. Natural language test specifications are more resilient than CSS selectors, and vision capabilities enable testing visual properties that traditional automation frameworks cannot assess.
Tools for building browser and OS agents (2024/2025). The browser agent ecosystem has matured significantly. Playwright MCP (Model Context Protocol server) lets LLMs control browsers through a standardized tool interface. Browser Use (open source, Python) provides a high-level agent framework that wraps Playwright with screenshot-based navigation and automatic element detection. Stagehand (by Browserbase) combines DOM parsing with vision models for more reliable element identification. For OS-level agents, Anthropic's computer use API enables Claude to take screenshots, move the mouse, click, and type, operating any desktop application. Apple's on-device models power Apple Intelligence features that understand app context and perform cross-app actions. For production deployments, always run browser agents in isolated containers (Docker, cloud VMs) to prevent unintended access to sensitive systems, and implement action allowlists that restrict which URLs the agent can visit and which form fields it can fill.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Who: Robotics engineering team at an e-commerce fulfillment center
Situation: The warehouse used robotic arms for picking items from bins, but the system failed on 15% of picks involving novel items not seen during training (new products added weekly).
Problem: Traditional vision-based grasping models required retraining whenever new product categories were introduced, creating a 2-week lag between product listing and automated fulfillment capability.
Dilemma: Retraining the grasping model weekly was expensive and risked degrading performance on existing categories. Manual picking for novel items was a bottleneck that negated the efficiency gains from automation.
Decision: The team integrated a VLM (vision-language model) into the picking pipeline using a SayCan-inspired approach: the VLM identified objects and suggested grasp strategies, while an affordance model scored which strategies were physically feasible for the robot.
How: The VLM received a camera image of the bin and a text description of the target item, then proposed grasp approaches (top, side, pinch). The affordance model filtered proposals based on the robot's physical capabilities and bin geometry. A safety layer prevented picks when confidence was below threshold, routing those items to a human picker.
Result: Novel item pick success rate improved from 85% to 96%. The 2-week retraining lag was eliminated because the VLM generalized to new products from their text descriptions. The safety threshold kept the failure rate for attempted picks below 1%.
Lesson: Grounding LLM reasoning in physical affordances (what the robot can actually do) bridges the gap between semantic understanding and reliable physical manipulation.
- SayCan grounds LLM plans in robot capabilities by combining semantic understanding with physical affordance scoring.
- RT-2 achieves end-to-end vision-language-action reasoning, generalizing to concepts not seen during robot training.
- VLA models (OpenVLA, pi0, GR00T N1) have rapidly advanced, with open-source 7B models matching or exceeding proprietary 55B systems, and flow-matching action heads enabling continuous, dexterous control.
- Cross-embodiment transfer via the Open X-Embodiment dataset and generalist policies like Octo mirrors the pretrain-then-fine-tune paradigm from NLP.
- Sim-to-real is being transformed by LLM-generated reward functions (Eureka) and LLM-guided synthetic data generation, enabling hundreds of thousands of training trajectories in hours rather than months.
- Physical safety requires defense in depth: affordance checking, formal temporal logic constraints (RoboGuard), and independent verification layers beyond the LLM's own judgment.
- Web automation agents use observe-think-act loops with DOM and screenshot observations to navigate and interact with websites.
- OS-level agents (Computer Use) extend automation beyond browsers to the full desktop, enabling cross-application workflows.
- AI for mathematics combines LLM informal reasoning with formal verification, achieving breakthrough results on competition-level problems.
- Scientific discovery benefits from LLM literature mining, hypothesis generation, and experiment design, with human researchers validating and testing AI-generated ideas.
World models for embodied AI are a major research direction. Systems like Google DeepMind's Genie 2 and Meta's V-JEPA learn predictive world models that enable robots to simulate the consequences of actions before executing them physically.
In scientific discovery, AI systems are beginning to close the loop: generating hypotheses, designing experiments, running them through automated labs, and analyzing results to refine hypotheses iteratively. The convergence of LLM reasoning with robotic manipulation, automated labs, and formal verification systems points toward increasingly autonomous scientific and engineering agents.
Exercises
Explain how an LLM can serve as a high-level planner for a robot. What is the role of the LLM versus the low-level controller, and what are the challenges of grounding language in physical actions?
Answer Sketch
The LLM takes a natural language instruction (e.g., 'make a sandwich') and decomposes it into a sequence of primitive actions (open fridge, pick up bread, etc.). The low-level controller translates each primitive into motor commands. Challenges: the LLM may generate physically impossible actions, it does not understand the robot's actual capabilities, and language is ambiguous (e.g., 'put it there' requires spatial grounding). Solutions: affordance functions that filter LLM suggestions based on what is physically possible.
Describe the architecture of a Vision-Language-Action (VLA) model like RT-2 or pi0. How does it differ from using separate vision, language, and action models?
Answer Sketch
A VLA model processes visual observations, language instructions, and action history in a single transformer. Actions are tokenized and predicted as part of the same token sequence as language. This differs from separate models because cross-modal interactions happen at every layer, allowing the model to ground language in visual observations and produce actions that are spatially and temporally coherent. Separate models require explicit translation between representations at each handoff.
Explain the sim-to-real gap in robotics and how LLMs are being used to help bridge it. What role can LLMs play in generating simulation scenarios?
Answer Sketch
The sim-to-real gap: policies trained in simulation often fail in the real world due to differences in physics, visual appearance, and sensor noise. LLMs help by: (1) generating diverse task descriptions that drive simulation scenario creation, (2) producing reward functions from natural language specifications, and (3) creating domain randomization parameters. This increases the diversity of training scenarios, making policies more robust to real-world variations.
Write a Python function that takes a research topic, searches arXiv for relevant papers, extracts key findings from each abstract, and produces a structured literature review.
Answer Sketch
Use the arXiv API to search for papers by topic. For each result, extract: title, authors, abstract, date. Send each abstract to an LLM to extract: main contribution, methodology, key results, and limitations. Group papers by theme using LLM-based clustering. Produce a structured review with sections: background, methods, findings, open questions. Include proper citations for every claim.
Discuss the current capabilities and limitations of AI systems for mathematical reasoning. Can LLMs prove theorems, and if so, how do their approaches differ from human mathematicians?
Answer Sketch
Current capabilities: LLMs can solve competition-level math problems (IMO 2024: AlphaProof solved 4 of 6 problems), generate and verify proofs in formal languages (Lean, Isabelle), and discover new patterns. Limitations: LLMs still struggle with multi-step logical reasoning, can produce plausible but incorrect proofs, and lack the creative intuition that guides human mathematicians toward interesting conjectures. The most effective approaches combine LLMs with formal verification systems that guarantee correctness.
Lab: Speech-to-Text Pipeline with LLM Summarization
Objective
Build an end-to-end audio processing pipeline that transcribes speech using OpenAI Whisper, then chains the transcript into an LLM for summarization. You will first implement a manual transcription-to-summary pipeline (the "right tool" approach), then streamline it with the transformers pipeline API.
What You'll Practice
- Loading and preprocessing audio files for Whisper transcription
- Running speech-to-text inference with the Whisper model
- Chaining transcription output into an LLM summarization step
- Comparing manual pipeline construction with the
transformerspipeline shortcut
Setup
Install the required packages. Whisper runs on CPU but benefits greatly from GPU acceleration.
pip install openai-whisper transformers torch torchaudio
Steps
Step 1: Prepare an audio sample
Load a sample audio file. You can use any WAV or MP3 file, or generate a short test clip with text-to-speech.
import torch
import torchaudio
import urllib.request
import os
# Download a sample audio clip (LibriSpeech test sample)
audio_url = "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav"
audio_path = "sample_speech.wav"
if not os.path.exists(audio_path):
urllib.request.urlretrieve(audio_url, audio_path)
waveform, sample_rate = torchaudio.load(audio_path)
duration = waveform.shape[1] / sample_rate
print(f"Audio loaded: {duration:.1f}s at {sample_rate} Hz")
print(f"Channels: {waveform.shape[0]}, Samples: {waveform.shape[1]}")
Step 2: Transcribe with Whisper (from-scratch approach)
Load the Whisper model directly and run transcription. This manual approach gives you full control over model size, language detection, and decoding parameters.
import whisper
# Load a small model (choose "base", "small", "medium", or "large")
model = whisper.load_model("base")
# Transcribe the audio file
result = model.transcribe(audio_path, fp16=torch.cuda.is_available())
transcript = result["text"].strip()
language = result.get("language", "unknown")
print(f"Detected language: {language}")
print(f"Transcript ({len(transcript)} chars):")
print(f" {transcript[:500]}")
# Inspect segment-level timestamps
for seg in result["segments"][:5]:
start, end = seg["start"], seg["end"]
print(f" [{start:.1f}s - {end:.1f}s] {seg['text'].strip()}")
Hint
The base model is fast but less accurate. For production use, small or medium offer a better accuracy-to-speed tradeoff. The fp16 flag enables half-precision on GPU for faster inference.
Step 3: Summarize the transcript with an LLM
Chain the Whisper output into a text summarization model. This demonstrates the common pattern of using a specialized model (Whisper) for perception, then an LLM for reasoning.
from transformers import pipeline
# Load a summarization model
summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn",
device=0 if torch.cuda.is_available() else -1,
)
# Summarize the transcript
# BART-CNN has a 1024-token limit; truncate if needed
max_chars = 3000
input_text = transcript[:max_chars]
summary = summarizer(
input_text,
max_length=150,
min_length=30,
do_sample=False,
)
print("Original transcript length:", len(transcript), "chars")
print("Summary:")
print(f" {summary[0]['summary_text']}")
Step 4: Streamline with the transformers pipeline
Now use the transformers automatic speech recognition pipeline as a shortcut, comparing it with the manual Whisper approach above.
from transformers import pipeline
# One-liner ASR pipeline using Whisper
asr_pipeline = pipeline(
"automatic-speech-recognition",
model="openai/whisper-base",
device=0 if torch.cuda.is_available() else -1,
)
# Transcribe with the pipeline API
asr_result = asr_pipeline(audio_path)
pipeline_transcript = asr_result["text"].strip()
# Compare outputs
print("Manual Whisper transcript:")
print(f" {transcript[:200]}...")
print()
print("Pipeline transcript:")
print(f" {pipeline_transcript[:200]}...")
print()
# Check similarity
from difflib import SequenceMatcher
similarity = SequenceMatcher(
None, transcript.lower(), pipeline_transcript.lower()
).ratio()
print(f"Transcript similarity: {similarity:.2%}")
Extensions
- Add speaker diarization using
pyannote-audioto identify who is speaking in multi-speaker recordings. - Replace the BART summarizer with an instruction-tuned LLM (e.g., Flan-T5) and compare summary quality using ROUGE scores.
- Build a real-time streaming version that transcribes audio chunks as they arrive, using Whisper's segment-based decoding.
What Comes Next
In the next chapter, Chapter 29: Evaluation, Experiment Design & Observability, we turn to evaluation and observability, the practices that ensure LLM applications work reliably in production.
Bibliography
Ahn, M., Brohan, A., Brown, N., et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan)." arXiv:2204.01691
Brohan, A., Brown, N., Carbajal, J., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818
Kim, M.J., Pertsch, K., Karamcheti, S., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246
Black, K., Brown, N., Driess, D., et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence Technical Report
NVIDIA. (2025). "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." NVIDIA Developer Blog
Open X-Embodiment Collaboration. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864
Team, Octo Model, et al. (2024). "Octo: An Open-Source Generalist Robot Policy." arXiv:2405.12213
Ma, Y.J., Liang, W., Wang, G., et al. (2023). "Eureka: Human-Level Reward Design via Coding Large Language Models." arXiv:2310.12931
Chi, C., Feng, S., Du, Y., et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." arXiv:2303.04137
Tong, A., Zheng, Z., Bauer, J., et al. (2024). "RoboGuard: A Guardrail Architecture for Safety-Critical Robotic Systems." arXiv:2403.02847
Zheng, S., Feng, X., Liao, Z., et al. (2024). "GPT-4V(ision) is a Generalist Web Agent, if Grounded." arXiv:2401.01614
Trinh, T.H. & Le, Q.V. (2024). "Solving Olympiad Geometry without Human Demonstrations (AlphaGeometry)." Nature, 625, 476-482
Si, C., Yang, D., & Hashimoto, T. (2024). "Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers." arXiv:2409.04109
