VLA Models and LLM-Powered Robotics

Chapter opener illustration: VLA Models and LLM-Powered Robotics.

"An LLM that can move a robot must first be told what an arm even is."

CompassCompass, Embodied AI Agent
Looking Back

Chapters 22 and 23 stayed inside the screen. This chapter steps out: Vision-Language-Action (VLA) models such as RT-2, OpenVLA, and Pi 0 turn a VLM into a policy that drives real or simulated robots. We cover the architecture, the training data, and the safety story.

Big Picture

How LLMs act in the physical world. The first half covers Vision-Language-Action models (OpenVLA, Physical Intelligence pi-0, RT-2-X, and the VLA design space, comparisons, and limitations). The second half covers LLM-powered robotics: SayCan-style planning, Code-as-Policies, VoxPoser spatial reasoning, multi-robot dispatch, ROS 2 integration, planner comparison, and the sim-to-real gap.

Chapter Overview

Vision-Language-Action models extend a multimodal LLM's vocabulary with motor tokens, so the same softmax that picks "Paris" given "capital of France" can pick a gripper command given a visual scene. This chapter is the most concrete tour of robotics in the book: OpenVLA-7B as the reference open implementation, Physical Intelligence's pi-0 and pi-0.5 with their flow-matching action heads, RT-2-X and the data-scaling lesson, a side-by-side comparison of VLA models, and the limitations that still bound 2026 deployments. It then layers in the planning side: SayCan, Code-as-Policies, VoxPoser, multi-robot dispatch, ROS 2 integration, and the sim-to-real gap that every working deployment crosses.

VLAs are the most concrete way to see how LLMs become agents in the physical world. The chapter pairs architecture with deployment so that you finish able to read a robotics paper, evaluate a VLA stack, and reason about the planning layer above it.

Note: Learning Objectives

Prerequisites

Sections

What's Next?

Next: Chapter 25: Tools of the Trade, Multimodal Stack. Chapter 25 closes Part V with the consolidated multimodal toolkit: audio codecs and TTS engines (Bark, F5-TTS, XTTS), the document-AI stack (Marker, MinerU, surya), VLM serving frameworks, 3D engines (gsplat, Splatfacto), and the VLA training rigs (Hugging Face LeRobot, OpenVLA). After that, Part VI moves from perceiving and acting in single steps to building agents that plan, use tools, and coordinate.