Chapter 24: VLA Models and LLM-Powered Robotics

Chapter opener illustration: VLA Models and LLM-Powered Robotics.

"An LLM that can move a robot must first be told what an arm even is."
Compass, Embodied AI Agent

Looking Back

Chapters 22 and 23 stayed inside the screen. This chapter steps out: Vision-Language-Action (VLA) models such as RT-2, OpenVLA, and Pi 0 turn a VLM into a policy that drives real or simulated robots. We cover the architecture, the training data, and the safety story.

Big Picture

How LLMs act in the physical world. The first half covers Vision-Language-Action models (OpenVLA, Physical Intelligence pi-0, RT-2-X, and the VLA design space, comparisons, and limitations). The second half covers LLM-powered robotics: SayCan-style planning, Code-as-Policies, VoxPoser spatial reasoning, multi-robot dispatch, ROS 2 integration, planner comparison, and the sim-to-real gap.

Chapter Overview

Vision-Language-Action models extend a multimodal LLM's vocabulary with motor tokens, so the same softmax that picks "Paris" given "capital of France" can pick a gripper command given a visual scene. This chapter is the most concrete tour of robotics in the book: OpenVLA-7B as the reference open implementation, Physical Intelligence's pi-0 and pi-0.5 with their flow-matching action heads, RT-2-X and the data-scaling lesson, a side-by-side comparison of VLA models, and the limitations that still bound 2026 deployments. It then layers in the planning side: SayCan, Code-as-Policies, VoxPoser, multi-robot dispatch, ROS 2 integration, and the sim-to-real gap that every working deployment crosses.

VLAs are the most concrete way to see how LLMs become agents in the physical world. The chapter pairs architecture with deployment so that you finish able to read a robotics paper, evaluate a VLA stack, and reason about the planning layer above it.

Note: Learning Objectives

Explain the VLA equation: how motor tokens enter a multimodal LLM's vocabulary.
Walk the OpenVLA-7B reference implementation end to end.
Compare discrete-token action heads with the flow-matching action experts in pi-0 and pi-0.5.
Apply the data-scaling lessons from RT-2-X to a new VLA training run.
Architect a SayCan, Code-as-Policies, or VoxPoser planning stack on top of a VLA.
Integrate a VLA agent with ROS 2 and diagnose sim-to-real gaps in deployment.

Prerequisites

Vision-language models from Chapter 22
Reinforcement-learning basics from Chapter 0
Some prior exposure to robotics or control helps but is optional

Sections

What's Next?

Next: Chapter 25: Tools of the Trade, Multimodal Stack. Chapter 25 closes Part V with the consolidated multimodal toolkit: audio codecs and TTS engines (Bark, F5-TTS, XTTS), the document-AI stack (Marker, MinerU, surya), VLM serving frameworks, 3D engines (gsplat, Splatfacto), and the VLA training rigs (Hugging Face LeRobot, OpenVLA). After that, Part VI moves from perceiving and acting in single steps to building agents that plan, use tools, and coordinate.