Section 24.6: VLA Limitations

"I can pick up a block. I can pour a cup of water. I cannot tie my shoelaces, and 2026 still owes me an answer."
Hallux, Honestly-Limited AI Agent

Big Picture

2026 VLAs work, but they work in a narrower range than the marketing implies. Three structural limitations cut across all five families surveyed in Section 24.5: the sim-to-real gap, the dexterity ceiling, and the unresolved safety story. This section names each limitation precisely, explains why it has not yielded to model scaling, and points toward the research directions most likely to dent it in 2026-2027.

Prerequisites

This section assumes the VLA mechanics from Section 24.1. Agent-safety considerations are covered in detail later in the book.

24.6.1 The Sim-to-Real Gap

The first structural limitation is that a policy trained entirely in simulation drops 30 to 50 percentage points in success rate when deployed on a real robot of the same nominal embodiment. The gap is dominated by visual domain shift, contact-dynamics mismatch, and per-unit robot calibration, and crucially it scales with training data only at 1 to 2 pp per order of magnitude, far slower than overall success rate. That is enough to know in this limitations tour. The full anatomy of the gap, the mitigation stack (domain randomization, system identification, real-world fine-tuning, hardware-in-the-loop validation), and the 2025 to 2026 deployment playbook are covered as their own section in Section 24.13: Sim-to-Real Gap.

24.6.2 The Dexterity Ceiling

A friendly cartoon robot arm gleefully stacks apples into a neat pyramid, while its gripper hovers in confusion over a single candy bar lying on the table next to the bowl. — **Figure 24.6.1**: Every modern VLA can pick the apple. Almost none of them can unwrap the candy bar. Dexterity is the line between "demo" and "production".

Every VLA surveyed in Section 24.5 can do "pick the apple and put it in the bowl". None of them, as of Q1 2026, can reliably do "unwrap the candy bar". The structural problem is that the public VLA datasets are dominated by single-arm, two-finger gripper tasks where the action space is 7-D and the contact patches are small. Multi-finger manipulation, in-hand reorientation, and contact-rich assembly require either much higher-dimensional action spaces (the Shadow Hand has 24 DOF) or qualitatively different control strategies (impedance control, tactile feedback) that current VLAs do not model.

Task category	Action DOF needed	Best 2026 VLA success (%)	Human baseline (%)
Pick-and-place	7 (end-effector + gripper)	~88	~99
Two-arm coordination (no contact)	14 (bimanual)	~76	~98
Two-arm contact (e.g. plate handoff)	14 + contact	~58	~95
In-hand reorientation	16-24 (multi-finger)	~12	~85
Tool use (screwdriver, scissors)	7-14 + force control	~22	~90
Fine assembly (snap-fit, threading)	7 + 6-D tactile force	<10	~80

Table 24.6.1a: The dexterity ceiling. Tasks below the "tool use" line are not solvable by any open-source VLA. The frontier (pi-0.5, Figure-02, Tesla Optimus) does meaningfully better on bimanual but still cannot do in-hand reorientation reliably.

Warning: Dexterity is not a data problem in the same way

The data-scaling argument from Section 24.4 applies to distribution coverage: more diverse robots and tasks in training improve generalization to new robots and tasks. Dexterity is a different problem: it requires action-space coverage, including 24-DOF hand actions and tactile feedback channels that simply do not exist in the OXE mixture. Collecting that data is harder than scaling existing teleop pipelines, because operating a multi-finger hand by teleop is itself a research problem. Dexterity will likely require new hardware (cheap dexterous hands like the LEAP hand) plus new training paradigms (RL on simulated dexterous tasks, then sim-to-real with tactile co-training) before it scales the way pick-and-place did.

24.6.3 The Safety Story Nobody Has Solved

A sheepish-looking cartoon humanoid robot wears three nested safety vests, one inside the other, labelled COLLISION AVOIDANCE, FORCE LIMITER, and ANOMALY DETECTOR. — **Figure 24.6.2**: A VLA-era safety story: no single layer is enough, so production robots wear nested vests of collision avoidance, hardware force limits, and anomaly detection on the action stream.

Text-LLM safety is hard, but a misbehaving text LLM produces tokens that humans can read, ignore, or filter. A misbehaving VLA produces motor torques that move a 30 kg robot through space. The failure modes are physical, irreversible, and potentially injurious. No 2026 VLA family has a fully satisfactory safety story; the working practice is to bolt on three orthogonal safety layers:

Hard collision avoidance: a separate non-learned layer (typically a real-time OBB or SDF check) halts the robot if the predicted next pose enters a forbidden volume. This is required by every safety-certification body and is not negotiable.
Speed and force limits: a hardware-level torque clamp prevents the robot from exerting more than a few Newtons on any contact point, on the theory that even an incorrect motion at low force will not injure a bystander.
Anomaly detection on the action stream: a classifier monitors the predicted action chunks for out-of-distribution patterns (sudden direction changes, action magnitudes outside training distribution) and triggers a soft stop.

# Sketch of the three-layer safety stack that runs around a VLA in production.
import numpy as np

class SafetyWrapper:
    def __init__(self, policy, collision_checker, force_limiter, anomaly_detector):
        self.policy = policy
        self.collision = collision_checker
        self.force = force_limiter
        self.anomaly = anomaly_detector

    def step(self, obs, instruction):
        action = self.policy.predict(obs, instruction)      # the VLA decision

        # Layer 1: hard collision avoidance. Never negotiable.
        if self.collision.would_collide(action, obs.scene_geometry):
            return self._emergency_halt(reason="collision predicted")

        # Layer 2: force/torque clamping. Hardware safety net.
        action = self.force.clamp(action, max_force_n=15.0)

        # Layer 3: anomaly detection on the action stream.
        if self.anomaly.is_anomalous(action):
            return self._soft_stop(reason="OOD action")

        return action

Code Fragment 24.6.1b: The three-layer safety stack that wraps every production VLA. The wrapper is the only piece between the model's prediction and the robot's motors. None of the three layers is learned; they are classical engineering, written by hand by safety engineers and certified to the relevant standard (ISO/TS 15066 for collaborative robots, IEC 61508 for safety-critical control).

Warning: "The model said do X" is not a defense

If a VLA misbehaves and someone is injured, the legal and ethical responsibility lies with the integrator who deployed it, not with the model authors. This shapes the working practice in 2026: every commercial robot deployment treats the VLA as a "fallible policy" with a non-learned safety envelope around it. The envelope's parameters (max force, max speed, forbidden volumes) are tuned to the deployment context. This is the inverse of the chat-LLM situation, where the model itself is the last line of defense; in robotics, the model is rarely allowed to be the last line of defense.

24.6.4 The Language-Understanding Cliff

VLAs inherit the language capabilities of their underlying LLM trunk, but their grounding to physical action is much weaker than their grounding to text. A VLA understands "place the red block on the plate" perfectly. It performs worse on "place the block on the plate after you finish stacking the others". It frequently fails on negation ("do not pick up the red block"), temporal ordering, and counterfactual reasoning ("would this fit if I rotated it 90 degrees?"). The cliff is steep: an instruction one or two compositional steps beyond the training distribution drops success by 30-50 percentage points.

Fun Fact: "Pick up the block that is not red"

One stress-test that has become a community in-joke: tell a VLA "pick up the block that is not red" in a scene with one red and one blue block. Most 2026 VLAs pick the red block roughly 40 percent of the time, because the policy attends to the salient noun phrase ("red block") and weakly to the negation. This is the same negation failure that plagued early instruction-tuned LLMs (Asai et al., 2023), now visible in the action stream rather than the text. Fixing it requires training data with negation in the instructions, which is a small fraction of existing teleop corpora.

24.6.5 The Evaluation Problem

VLA evaluation is harder than LLM evaluation in three ways. First, real-robot evaluation is slow: a single benchmark run takes hours per task per model. Second, it is non-reproducible across labs: small differences in robot calibration, camera placement, and lighting cause 10-20 percentage-point swings in reported success. Third, simulation evaluation is biased: the LIBERO and ManiSkill simulators are biased toward tasks that simulate cleanly, which underrepresents the contact-rich and deformable-object tasks the field most needs to advance on. The community's response in 2025-2026 has been the SimplerEnv suite (Li et al., 2024), which provides a standardized real-robot evaluation protocol with shared hardware lists and lighting standards. Adoption is partial; most published results still mix protocols in ways that frustrate comparison.

Research Frontier

What's emerging on the limitations frontier

Three research threads are credibly attacking these limitations in 2026. (a) Tactile-augmented VLAs (DIGIT, GelSight) add force and texture sensing as additional input modalities, which is the most plausible attack on the dexterity ceiling. (b) World-model integration (covered in Chapter 40) lets the policy mentally simulate the next few seconds before acting, which addresses the temporal-reasoning weaknesses of pure feedforward VLAs. (c) Real-robot RL from VLA priors: use the VLA as the policy initialization and run on-robot RL with safety constraints to close the sim-to-real gap on a specific task family. None of these threads has produced a published model that beats pi-0.5 on a public benchmark, but the directions are promising.

24.6.6 When VLAs Are the Wrong Tool

It is worth ending on the cases where you should not use a VLA at all. If your task is fully scripted (industrial pick-and-place from a fixture, well-defined assembly), a classical motion planner plus a learned grasp model will outperform a VLA at lower cost and higher predictability. If your task is open-ended exploration (drone navigation, search-and-rescue), an RL policy with a different sensor mix will probably work better than a VLA. If your task requires sub-millimeter precision (electronics assembly, surgical robotics), VLAs are nowhere close to the precision floor that classical impedance control or vision-feedback servoing can deliver. The VLA sweet spot is "semi-structured manipulation with natural-language instruction in a moderately variable environment". That is a large and important slice of robotics work, but it is not all of robotics.

Key Takeaway

Key Insight

Three structural limitations cut across every 2026 VLA: a ~30 pp sim-to-real gap that scaling does not close, a dexterity ceiling above 7-DOF / multi-finger manipulation, and an unsolved safety story that forces production deployments to wrap the policy in a non-learned envelope. Knowing where the limits live is the difference between shipping a working system and pitching investors on a robot that does not exist yet.

Self-Check

Q1: Why does training a VLA on more simulation data fail to close the sim-to-real gap, while training on more cross-embodiment real data does close the cross-embodiment gap? Tie your answer to the difference between statistical and structural error.

Show Answer

Cross-embodiment real-world data improves statistical coverage: each new robot, object, or lighting condition is a draw from the true input distribution, so more data shrinks the variance of the policy's estimates and the scaling law applies. The sim-to-real gap is structural: simulators systematically misrepresent contact dynamics, friction, deformable physics, and sensor noise. Adding more simulation trajectories produces more samples from the wrong distribution; the bias does not shrink with sample size. Closing the structural gap requires either better simulators, real-world fine-tuning, or hybrid sim-plus-real co-training, not more synthetic samples.

Q2: Sketch the three-layer safety stack from Code Fragment 24.6.1c and identify which layer would have caught (a) a model that suddenly accelerates a 10 kg payload, (b) a model that drives the gripper into a table at full force, (c) a model that produces a smooth but unnecessary movement at high speed.

Show Answer

The stack is (1) collision avoidance via OBB or SDF check, (2) force and speed clamping at hardware level, (3) anomaly detection on the action stream. (a) Sudden acceleration of a 10 kg payload would be caught by the anomaly detector, since it identifies action magnitudes outside the training distribution. The force clamp would not help (acceleration is not force on a static target until contact). (b) Driving the gripper into a table is the canonical job of the collision avoidance layer; the SDF check halts before contact. The force clamp is a fallback if collision detection misses it. (c) A smooth high-speed unnecessary movement is hardest: it does not collide and is not anomalous in shape. The speed limit clause of the force-and-speed layer is the only catch, by capping peak velocity below safe thresholds even when the policy requests faster motion.

Q3: You are reviewing a startup pitch claiming "our VLA can do in-hand reorientation at 90 percent success". What three follow-up questions should you ask to validate the claim against Figure 24.6.1d?

Show Answer

First, what robot embodiment and how many DOF in the hand? Table 24.6.1 puts in-hand reorientation at around 12% open-source / 85% frontier on 16-24 DOF multi-finger hands; a 90% claim either uses a much simpler hand or a non-standard task definition. Second, what is the object set and reorientation goal precision? "Reorientation" can mean coarse goal-yaw to within 30 degrees or fine pose to within 5 degrees; the difficulty differential is large. Third, is the evaluation on real hardware or in simulation? Sim numbers are typically 30 pp inflated relative to real for contact-rich tasks; insisting on real-robot evaluation under SimplerEnv conditions usually exposes the gap.

What's Next

Continue to Section 24.7: SayCan: Grounding LLM Plans. The earlier sections of Chapter 24 covered the policy layer: what a VLA looks like, how it is trained, what it cannot do. Section 24.7 moves up one layer to the planner: how do you ask an LLM to decompose a high-level instruction into the subgoals a VLA can execute? The SayCan, Code-as-Policies, and VoxPoser families covered there are what sits on top of OpenVLA or pi-0 in a working production stack.

Further Reading

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Black, K., et al. (2024). pi-0: A Vision-Language-Action Flow Model for General Robot Control. "Physical Intelligence Technical Report".

Li, X., et al. (2024). Evaluating Real-World Robot Manipulation Policies in Simulation (SimplerEnv). "CoRL 2024, arXiv:2405.05941".

Tobin, J., et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. "IROS 2017, arXiv:1703.06907".

ISO/TS 15066:2016. Robots and robotic devices: Collaborative robots. "International Organization for Standardization".

Khazatsky, A., et al. (2024). DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset. "RSS 2024, arXiv:2403.12945".

Lambeta, M., et al. (2020). DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor. "IEEE RA-L 2020, arXiv:2005.14679".