Section 24.5: Comparing VLA Models

"Six VLAs, eight metrics, one comparison table. Pick a column and you have picked a vendor."
Compass, Capability-Matrix-Operator AI Agent

Big Picture

This section consolidates the previous four into one side-by-side comparison. Five VLA families dominate the 2026 landscape: RT-2 / RT-2-X, OpenVLA, pi-0 / pi-0.5, Octo, and TinyVLA. They differ in action representation, model size, supported embodiments, inference latency, and licensing. The goal here is to give you a quick lookup matrix and a decision tree for picking among them on a real project.

Prerequisites

This section assumes the architectures in Section 24.1 through Section 24.4 and the licensing landscape from Section 10.6.

24.5.1 The Capability Matrix

Fun Fact

BridgeData-V2, the de facto robot-benchmark in this comparison table, was collected at UC Berkeley by Levine's lab using a fleet of low-cost WidowX arms purchased from Trossen Robotics for roughly $1,000 each. The dataset contains over 60,000 teleoperation episodes and was released under a permissive license in 2023; the cost of collecting it was reportedly less than the cost of one frontier model's monthly inference bill at a major cloud provider.

The single most useful artifact in this chapter is the cross-model capability table. The columns are the dimensions a practitioner cares about; the rows are the five families. Numbers reflect publicly reported metrics as of Q1 2026, normalized where possible to the BridgeData-V2 and LIBERO benchmarks.

Model	Backbone	Params	Action repr.	Chunk H	Latency (A100)	Open weights
RT-2-X	PaLI-X-55B	55B	Discrete tokens (256/dim)	1	~12 Hz	No
OpenVLA-7B	Llama-2 7B	7.6B	Discrete tokens (256/dim)	1	~6 Hz native, ~25 Hz tuned	Yes (Apache 2.0)
pi-0	PaliGemma-Mix-3B	3.3B	Flow matching (continuous)	50	~20 Hz	Partial (pi-0-fast only)
pi-0.5	PaliGemma-Mix-3B	3.3B + planner	Flow matching + plan tokens	50	~15 Hz	No
Octo	Custom ViT	27M / 93M	Diffusion (DDIM, 50 steps)	4-8	~5 Hz	Yes (MIT)
TinyVLA	Phi-3 mini	1.4B	Discrete tokens (256/dim)	1	~30 Hz	Yes (MIT)

Table 24.5.1: The dominant VLA families as of Q1 2026. Latency numbers are for a single A100 80GB in production-tuned configuration; native (out-of-the-box) latency is typically 2-3x slower.

24.5.2 Success Rates on the Public Benchmarks

BridgeData-V2 and LIBERO are the two most-cited public benchmarks in 2026. BridgeData-V2 measures real-robot success on a fixed WidowX arm across novel object placements; LIBERO is a simulator suite covering long-horizon tasks. Reporting on these two benchmarks lets you compare models that otherwise disclose different evaluation protocols.

Model	BridgeData-V2 (real, %)	LIBERO-Goal (sim, %)	LIBERO-10 (long horizon, %)	Cross-embodiment OOD (%)
RT-2-X (PaLI-X-55B)	78	91	72	63
OpenVLA-7B	72.5	88	67	58
pi-0	81	93	78	66
pi-0.5	84	95	86	71
Octo-93M	69	84	61	52
TinyVLA (1.4B)	61	78	49	43

Table 24.5.2: Reported success rates across the public benchmark suite. Numbers are from each model's primary publication; small differences (under ~3 pp) are within evaluation noise and should not be over-interpreted. pi-0.5 is the current state of the art on every benchmark; TinyVLA shows that small models can still ship if you accept the quality drop.

Key Insight: The Pareto frontier is small

If you plot success rate against inference latency, the Pareto frontier as of Q1 2026 has three points: TinyVLA (cheap, low quality), OpenVLA tuned (mid cost, mid quality, open weights), pi-0.5 (expensive, high quality, closed weights). Octo is dominated by OpenVLA on the same benchmarks; RT-2-X is dominated by pi-0 except on web-knowledge tasks (where the 55B trunk's pretraining wins). For a new project the decision is essentially a three-way choice among the Pareto points, plus the practical filter of whether closed-weights pi-0.5 access is feasible for your team.

24.5.3 Decision Tree: Which VLA Do You Pick

The following decision tree captures the working recommendations from the last 18 months of robotics-team interviews. It is not authoritative, but it matches what most production teams converge on.

def pick_vla(project):
    """Pragmatic decision tree for choosing a VLA in 2026."""
    if project.requires_humanoid_or_bimanual:
        if project.has_pi_access:
            return "pi-0.5"
        else:
            return "pi-0-fast + OpenPI training stack, accept the data gap"

    if project.vram_budget_gb < 12:
        return "TinyVLA-1.4B, accept the ~10 pp quality drop"

    if project.requires_open_weights:
        return "OpenVLA-7B with LoRA finetune on your robot"

    if project.latency_budget_hz >= 25:
        return "OpenVLA-7B + INT4 + TensorRT-LLM + speculative"

    if project.requires_long_horizon_planning:
        return "pi-0.5 if accessible, else OpenVLA + a separate planner LLM"

    return "OpenVLA-7B (the default working answer)"

Code Fragment 24.5.1a: The decision tree as a Python function. The defaults reflect 2026 practitioner consensus; revisit annually as new checkpoints land. The "OpenVLA-7B is the default" recommendation has been stable for 12 months and unlikely to shift before pi-0.5 releases open weights.

24.5.4 The Action Vocabulary Axis

The most important architectural choice that distinguishes the families is how they represent actions. The three options in 2026 are discrete tokens (OpenVLA, RT-2-X, TinyVLA), diffusion (Octo), and flow matching (pi-0, pi-0.5). Each comes with predictable trade-offs:

Representation	Inference cost	Smoothness	Multimodal grasps	High-DOF support	Interpretability
Discrete tokens	1 forward / DOF	Quantization chatter at bin edges	Yes (natural)	Linear in DOF (slow)	High (inspect logits)
Diffusion (DDIM/DDPM)	~50 forwards / chunk	Smooth	Yes	Constant in DOF	Medium
Flow matching	5-10 forwards / chunk	Smooth	Yes	Constant in DOF	Low

Figure 24.5.1b: Trade-offs of the three action representations. The "interpretability" column is what makes discrete tokens hold on despite their other limitations: you can inspect the action vocabulary's logit distribution and immediately see what the policy is uncertain about. Flow matching gives smoother and faster control but is harder to debug.

Warning: Latency comparisons are slippery

Each model's reported latency depends heavily on inference-stack details (quantization, FlashAttention version, batch size, KV cache management, speculative decoding configuration). The numbers in Figure 24.5.1c assume a "production-tuned" stack, which means the maintainers' best-effort serving recipe; off-the-shelf checkpoints typically run 2-3x slower. Always benchmark on your own hardware before committing; a 30 percent latency miss can be the difference between a deployable policy and a research demo.

24.5.5 Licensing and the Open-Weights Frontier

Robotics has a more polarized open-weights situation than text. The open-weights end (OpenVLA, Octo, TinyVLA) lets you finetune, redistribute, and ship in commercial products without restriction (Apache 2.0 and MIT licenses dominate). The closed-weights end (pi-0.5, RT-2-X, Figure's internal models, Tesla Optimus, 1X Neo) is gated behind partnerships, research programs, or fully proprietary stacks. The gap matters because robotics teams cannot run a closed-weights model behind their own private API the way a SaaS company can pay for a Claude or GPT-5 endpoint; the policy has to run on the robot, which means the weights have to live on the robot, which means somebody at the closed-weights vendor has to trust your deployment.

Real-World Scenario

The "deploy on a robot in a customer's house" requirement

A home-robotics startup shipping cleaning robots cannot send camera frames to a cloud API for every action; the latency budget (and the privacy implications) preclude it. The policy must run on a NUC or Jetson on the robot itself, which means the model weights must be installable on hardware the startup ships to a customer. This requirement is what forces home-robotics teams toward open-weights checkpoints, even at a quality cost. The closed-weights frontier is more deployable in industrial settings where the robot operates inside a factory and can be permanently connected to a corporate network with vendor-managed inference servers.

24.5.6 What Changed in the Last 12 Months

The state of this matrix as of Q1 2026 differs from Q1 2025 in three ways. First, pi-0.5 displaced RT-2-X as the unambiguous quality leader. Second, OpenVLA gained INT4 + speculative + TensorRT inference tooling that closed most of the latency gap to pi-0. Third, TinyVLA emerged as a viable embedded option for resource-constrained robots; the 1.4B variant runs on a Jetson Orin AGX at ~25 Hz with 5 GB VRAM, which a year ago was impossible.

Research Frontier

What probably changes in the next 12 months

Three trends are visible in late-2025 publications and conference talks. (a) Open-weights pi-0 successors are likely; Physical Intelligence has hinted at a future "pi-1" with a permissive license. (b) Models with native multi-camera support (wrist + third-person + ego) will become standard; OpenVLA's single-frame limitation is increasingly a stumbling block. (c) The 7B-parameter sweet spot may shift to either 1-2B (driven by embedded deployment) or 30B+ (driven by scaling laws). The current 7B consensus is a stable Schelling point but probably not a long-run equilibrium.

Key Takeaway

Key Insight

Five families dominate VLA in 2026: RT-2-X (research artifact, closed), OpenVLA (default open-weights choice), pi-0 / pi-0.5 (quality leader, closed), Octo (cheap small option), and TinyVLA (embedded). The Pareto frontier is three points: TinyVLA, OpenVLA-tuned, and pi-0.5. The decision is dominated by open-weights requirements, latency budget, and dexterity ceiling.

Self-Check

Q1: Your team is building a home-cleaning robot that runs on a Jetson Orin (~32 GB unified memory, ~64 TOPS). Which of the five families realistically fits, and why?

Show Answer

Two families fit. TinyVLA (1.4B) is the most comfortable fit: it runs in around 5 GB at INT4 and sustains roughly 25 Hz on the Jetson, leaving headroom for vision, control, and ROS. OpenVLA-7B in INT4 plus FlashAttention-2 also fits in around 14 GB and runs at around 7 Hz, which is acceptable for slow cleaning maneuvers but tight for reactive tasks. pi-0 and pi-0.5 are off the table both for VRAM (the PaliGemma trunk plus the flow expert exceed the Jetson's deployable footprint) and for licensing; the weights cannot legally be shipped to a customer's house. RT-2-X is closed and far too large.

Q2: Reproduce the "Pareto frontier is small" claim from Figure 24.5.2a: plot reported success rate against latency for the six rows and identify the dominated models.

Show Answer

On the success-rate vs latency axes the three non-dominated points are: TinyVLA (cheapest, lowest quality), OpenVLA-7B tuned (mid latency, mid quality), and pi-0.5 (highest latency, highest quality). Octo-93M is dominated by OpenVLA: Octo reports lower success on BridgeData-V2 and LIBERO at comparable latency, so any team would pick OpenVLA. RT-2-X is dominated by pi-0 except on web-knowledge-heavy tasks; for the standard manipulation benchmarks pi-0 wins on quality and is cheaper to run. pi-0 itself is borderline-dominated by pi-0.5 since pi-0.5 strictly beats it on every reported benchmark at similar latency. The Pareto frontier is three concrete points, not six.

Q3: Why is "discrete tokens" cheaper in interpretability but more expensive in latency when the robot has 14+ DOF? Tie your answer to Figure 24.5.1d.

Show Answer

Discrete tokens emit one logit distribution per DOF, so you can inspect the softmax over 256 bins and immediately read off "the policy is 60% sure of bin 173, 30% of bin 200". That bimodal distribution is a free interpretability signal that flow matching and diffusion do not provide. The latency cost is linear in DOF: a 14-DOF arm requires 14 forward passes per timestep (multiplied by the horizon if you chunk), whereas flow matching takes a fixed 5-10 forwards per chunk regardless of DOF. So discrete tokens scale poorly past 7-8 DOF; for humanoids at 26 DOF they become impractical, which is why pi-0 abandoned them.

What's Next

Continue to Section 24.6: VLA Limitations.

Section 24.6 closes the chapter by stepping back to look at the limitations that all five families share: the sim-to-real gap, the dexterity ceiling, and the safety story that nobody in robotics has fully solved.

Further Reading

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. "arXiv:2307.15818".

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Black, K., et al. (2024). pi-0: A Vision-Language-Action Flow Model for General Robot Control. "Physical Intelligence Technical Report".

Physical Intelligence. (2025). pi-0.5: Scaling Robotic Foundation Models to Household Tasks. "Technical Report".

Octo Model Team. (2024). Octo: An Open-Source Generalist Robot Policy. "RSS 2024, arXiv:2405.12213".

Wen, J., et al. (2024). TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation. "arXiv:2409.12514".

Liu, S., et al. (2024). LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. "NeurIPS 2023 Datasets, arXiv:2306.03310".