"Six VLAs, eight metrics, one comparison table. Pick a column and you have picked a vendor."
Compass, Capability-Matrix-Operator AI Agent
This section consolidates the previous four into one side-by-side comparison. Five VLA families dominate the 2026 landscape: RT-2 / RT-2-X, OpenVLA, pi-0 / pi-0.5, Octo, and TinyVLA. They differ in action representation, model size, supported embodiments, inference latency, and licensing. The goal here is to give you a quick lookup matrix and a decision tree for picking among them on a real project.
Prerequisites
This section assumes the architectures in Section 24.1 through Section 24.4 and the licensing landscape from Section 10.6.
24.5.1 The Capability Matrix
BridgeData-V2, the de facto robot-benchmark in this comparison table, was collected at UC Berkeley by Levine's lab using a fleet of low-cost WidowX arms purchased from Trossen Robotics for roughly $1,000 each. The dataset contains over 60,000 teleoperation episodes and was released under a permissive license in 2023; the cost of collecting it was reportedly less than the cost of one frontier model's monthly inference bill at a major cloud provider.
The single most useful artifact in this chapter is the cross-model capability table. The columns are the dimensions a practitioner cares about; the rows are the five families. Numbers reflect publicly reported metrics as of Q1 2026, normalized where possible to the BridgeData-V2 and LIBERO benchmarks.
| Model | Backbone | Params | Action repr. | Chunk H | Latency (A100) | Open weights |
|---|---|---|---|---|---|---|
| RT-2-X | PaLI-X-55B | 55B | Discrete tokens (256/dim) | 1 | ~12 Hz | No |
| OpenVLA-7B | Llama-2 7B | 7.6B | Discrete tokens (256/dim) | 1 | ~6 Hz native, ~25 Hz tuned | Yes (Apache 2.0) |
| pi-0 | PaliGemma-Mix-3B | 3.3B | Flow matching (continuous) | 50 | ~20 Hz | Partial (pi-0-fast only) |
| pi-0.5 | PaliGemma-Mix-3B | 3.3B + planner | Flow matching + plan tokens | 50 | ~15 Hz | No |
| Octo | Custom ViT | 27M / 93M | Diffusion (DDIM, 50 steps) | 4-8 | ~5 Hz | Yes (MIT) |
| TinyVLA | Phi-3 mini | 1.4B | Discrete tokens (256/dim) | 1 | ~30 Hz | Yes (MIT) |
24.5.2 Success Rates on the Public Benchmarks
BridgeData-V2 and LIBERO are the two most-cited public benchmarks in 2026. BridgeData-V2 measures real-robot success on a fixed WidowX arm across novel object placements; LIBERO is a simulator suite covering long-horizon tasks. Reporting on these two benchmarks lets you compare models that otherwise disclose different evaluation protocols.
| Model | BridgeData-V2 (real, %) | LIBERO-Goal (sim, %) | LIBERO-10 (long horizon, %) | Cross-embodiment OOD (%) |
|---|---|---|---|---|
| RT-2-X (PaLI-X-55B) | 78 | 91 | 72 | 63 |
| OpenVLA-7B | 72.5 | 88 | 67 | 58 |
| pi-0 | 81 | 93 | 78 | 66 |
| pi-0.5 | 84 | 95 | 86 | 71 |
| Octo-93M | 69 | 84 | 61 | 52 |
| TinyVLA (1.4B) | 61 | 78 | 49 | 43 |
If you plot success rate against inference latency, the Pareto frontier as of Q1 2026 has three points: TinyVLA (cheap, low quality), OpenVLA tuned (mid cost, mid quality, open weights), pi-0.5 (expensive, high quality, closed weights). Octo is dominated by OpenVLA on the same benchmarks; RT-2-X is dominated by pi-0 except on web-knowledge tasks (where the 55B trunk's pretraining wins). For a new project the decision is essentially a three-way choice among the Pareto points, plus the practical filter of whether closed-weights pi-0.5 access is feasible for your team.
24.5.3 Decision Tree: Which VLA Do You Pick
The following decision tree captures the working recommendations from the last 18 months of robotics-team interviews. It is not authoritative, but it matches what most production teams converge on.
def pick_vla(project):
"""Pragmatic decision tree for choosing a VLA in 2026."""
if project.requires_humanoid_or_bimanual:
if project.has_pi_access:
return "pi-0.5"
else:
return "pi-0-fast + OpenPI training stack, accept the data gap"
if project.vram_budget_gb < 12:
return "TinyVLA-1.4B, accept the ~10 pp quality drop"
if project.requires_open_weights:
return "OpenVLA-7B with LoRA finetune on your robot"
if project.latency_budget_hz >= 25:
return "OpenVLA-7B + INT4 + TensorRT-LLM + speculative"
if project.requires_long_horizon_planning:
return "pi-0.5 if accessible, else OpenVLA + a separate planner LLM"
return "OpenVLA-7B (the default working answer)"
24.5.4 The Action Vocabulary Axis
The most important architectural choice that distinguishes the families is how they represent actions. The three options in 2026 are discrete tokens (OpenVLA, RT-2-X, TinyVLA), diffusion (Octo), and flow matching (pi-0, pi-0.5). Each comes with predictable trade-offs:
| Representation | Inference cost | Smoothness | Multimodal grasps | High-DOF support | Interpretability |
|---|---|---|---|---|---|
| Discrete tokens | 1 forward / DOF | Quantization chatter at bin edges | Yes (natural) | Linear in DOF (slow) | High (inspect logits) |
| Diffusion (DDIM/DDPM) | ~50 forwards / chunk | Smooth | Yes | Constant in DOF | Medium |
| Flow matching | 5-10 forwards / chunk | Smooth | Yes | Constant in DOF | Low |
Each model's reported latency depends heavily on inference-stack details (quantization, FlashAttention version, batch size, KV cache management, speculative decoding configuration). The numbers in Figure 24.5.1c assume a "production-tuned" stack, which means the maintainers' best-effort serving recipe; off-the-shelf checkpoints typically run 2-3x slower. Always benchmark on your own hardware before committing; a 30 percent latency miss can be the difference between a deployable policy and a research demo.
24.5.5 Licensing and the Open-Weights Frontier
Robotics has a more polarized open-weights situation than text. The open-weights end (OpenVLA, Octo, TinyVLA) lets you finetune, redistribute, and ship in commercial products without restriction (Apache 2.0 and MIT licenses dominate). The closed-weights end (pi-0.5, RT-2-X, Figure's internal models, Tesla Optimus, 1X Neo) is gated behind partnerships, research programs, or fully proprietary stacks. The gap matters because robotics teams cannot run a closed-weights model behind their own private API the way a SaaS company can pay for a Claude or GPT-5 endpoint; the policy has to run on the robot, which means the weights have to live on the robot, which means somebody at the closed-weights vendor has to trust your deployment.
A home-robotics startup shipping cleaning robots cannot send camera frames to a cloud API for every action; the latency budget (and the privacy implications) preclude it. The policy must run on a NUC or Jetson on the robot itself, which means the model weights must be installable on hardware the startup ships to a customer. This requirement is what forces home-robotics teams toward open-weights checkpoints, even at a quality cost. The closed-weights frontier is more deployable in industrial settings where the robot operates inside a factory and can be permanently connected to a corporate network with vendor-managed inference servers.
24.5.6 What Changed in the Last 12 Months
The state of this matrix as of Q1 2026 differs from Q1 2025 in three ways. First, pi-0.5 displaced RT-2-X as the unambiguous quality leader. Second, OpenVLA gained INT4 + speculative + TensorRT inference tooling that closed most of the latency gap to pi-0. Third, TinyVLA emerged as a viable embedded option for resource-constrained robots; the 1.4B variant runs on a Jetson Orin AGX at ~25 Hz with 5 GB VRAM, which a year ago was impossible.
Three trends are visible in late-2025 publications and conference talks. (a) Open-weights pi-0 successors are likely; Physical Intelligence has hinted at a future "pi-1" with a permissive license. (b) Models with native multi-camera support (wrist + third-person + ego) will become standard; OpenVLA's single-frame limitation is increasingly a stumbling block. (c) The 7B-parameter sweet spot may shift to either 1-2B (driven by embedded deployment) or 30B+ (driven by scaling laws). The current 7B consensus is a stable Schelling point but probably not a long-run equilibrium.
Key Takeaway
Five families dominate VLA in 2026: RT-2-X (research artifact, closed), OpenVLA (default open-weights choice), pi-0 / pi-0.5 (quality leader, closed), Octo (cheap small option), and TinyVLA (embedded). The Pareto frontier is three points: TinyVLA, OpenVLA-tuned, and pi-0.5. The decision is dominated by open-weights requirements, latency budget, and dexterity ceiling.
Show Answer
Show Answer
Show Answer
Continue to Section 24.6: VLA Limitations.
Section 24.6 closes the chapter by stepping back to look at the limitations that all five families share: the sim-to-real gap, the dexterity ceiling, and the safety story that nobody in robotics has fully solved.