"We mixed 22 robot datasets in one tokenizer, and somewhere a control theorist quietly closed their laptop."
Scale, Data-Mixed-And-Proud AI Agent
RT-2-X (Open X-Embodiment Collaboration, 2024) is the result you get when you take the RT-2 architecture and train it on the union of 21 institutions' robot data. The headline finding is that cross-embodiment scaling laws look exactly like the text-LLM scaling laws of Chapter 6: doubling data improves out-of-distribution success rate by a power-law amount, regardless of the source robot. This section lays out the cross-embodiment data recipe, the empirical scaling curves, and the practical guidance the curves imply for anyone training a VLA in 2026.
Prerequisites
This section assumes the VLA architecture from Section 24.1 and the scaling-law intuitions from Section 6.3. Familiarity with cross-domain transfer from Section 13.4 helps with the cross-embodiment discussion.
24.4.1 What "Cross-Embodiment" Means in Practice
The Open X-Embodiment dataset that powered RT-2-X pooled robot demonstrations from 34 research labs across 22 different robot platforms, totaling more than a million trajectories. It was the closest thing robotics had ever produced to an ImageNet moment, and it shipped roughly one decade after ImageNet itself, which is a humbling reminder that robots are way harder than cats. The dataset's logo cheerfully includes a Franka, a UR5, and a wheeled mobile manipulator standing in a row like a very stiff class photo.
Cross-embodiment training means a single model parameterized by a single action vocabulary learns to control many different robot bodies. The technical obstacle is that a 7-DOF Franka, a 6-DOF UR5, a wheeled mobile manipulator with a 7-DOF arm, and a bimanual 14-DOF ALOHA have action spaces that are not directly comparable. The Open X-Embodiment solution is a unified action space: all actions are expressed as 7-D end-effector deltas in a canonical world frame, plus a binary gripper command. Each contributing dataset ships a small adapter that maps its native action representation into this canonical 7-D space. The adapter is one of three pieces of robotics-specific code in the entire training stack; the other two are the action tokenizer and the image preprocessing.
| Robot family | Native action space | Canonical 7-D mapping |
|---|---|---|
| Franka FR3 (7-DOF arm) | Joint position commands | Forward-kinematics to end-effector delta |
| UR5e (6-DOF arm) | End-effector pose deltas | Direct, drop the redundant DOF |
| WidowX (6-DOF arm) | End-effector pose deltas | Direct |
| ALOHA (bimanual 14-DOF) | Joint position per arm | Two 7-D deltas, one per arm |
| Mobile manipulator | Base velocity + arm joints | Arm gets 7-D delta; base gets separate channel |
| Humanoid | Whole-body 26-30 DOF | Decomposed into multiple 7-D streams |
It is easy to miss this point because it is buried in the data-preprocessing code. The reason a single transformer can drive 22 different robots is not architectural; it is that all 22 robots have been forced to share an action vocabulary upstream of the model. The model only ever sees the 7-D canonical deltas. From its perspective, "training on 22 robots" is the same problem as "training on 22 different camera angles of the same robot". This is the same engineering move that BPE played for text: hide the underlying messiness (Unicode encodings, language differences) inside the tokenizer, so the model can pretend everything is just a token sequence.
24.4.2 The Open X-Embodiment Data Pyramid
The Open X-Embodiment v1 mixture (frozen in 2024) contains:
- 1.4 million trajectories total across 60 contributing datasets,
- 22 distinct robot embodiments (16 single-arm, 4 bimanual, 2 mobile manipulators),
- 970,000 trajectories after the OpenVLA "magic soup" filtering for usable language annotations and clean camera viewpoints,
- ~1.2 billion (image, language, action) tuples after timestep expansion, comparable in raw count to the GPT-2 training corpus.
The pyramid is sharply unbalanced: BridgeData-V2 alone is 60k trajectories, RT-1 is 130k, but most contributing datasets ship 1k to 5k trajectories. The "long tail" of small datasets matters more than the count suggests because each small dataset is the only source of a particular robot embodiment, camera arrangement, or task type. Removing the smallest 30 datasets (totaling under 5 percent of trajectories) drops out-of-distribution success rate by 15 percentage points, a finding that has shaped the design of v2 mixtures now under preparation.
# Inspecting the OXE mixture composition.
from openpi.training.config import OXE_MAGIC_SOUP
total_trajs = 0
for name, spec in sorted(OXE_MAGIC_SOUP.items(), key=lambda kv: -kv[1].weight):
print(f"{name:30s} weight={spec.weight:.4f} n_trajs={spec.n_trajectories:6d}")
total_trajs += spec.n_trajectories
print(f"Total: {total_trajs:,} trajectories")
# Top 3 lines (approximate):
# bridge_orig weight=0.2400 n_trajs= 60064
# fractal20220817_data weight=0.2300 n_trajs=130000
# kuka weight=0.1100 n_trajs=200000
# ...
# Total: 968,123 trajectories
24.4.3 The Scaling Curves That Came Out of RT-2-X
The RT-2-X paper reports a power-law relationship between training-data size and out-of-distribution success rate that closely mirrors text-LLM scaling. Let N be the number of training trajectories and let SR_OOD(N) be the success rate on a held-out task that was not in the training mixture. The empirical fit is approximately
SR_OOD(N) ~ 1 - C * N^(-alpha)
with alpha roughly 0.21 and C roughly 6.3, fit on the public benchmark suite. The interpretation: each doubling of cross-embodiment data buys you about 13 percentage points of out-of-distribution success rate, until you saturate near the high-90s. This is shockingly close to the text-LLM exponent of around 0.34 (Hoffmann et al., 2022, arXiv:2203.15556), the difference being that robot data costs roughly $5 per trajectory in human teleoperation time versus essentially free for web-scraped text.
| Training trajectories | In-distribution success | Out-of-distribution success | Cross-embodiment transfer |
|---|---|---|---|
| 10k (single robot) | 78% | 22% | n/a |
| 100k (3 robots) | 83% | 41% | +9 pp over single-robot |
| 500k (12 robots) | 86% | 55% | +24 pp |
| 970k (22 robots, full OXE) | 88% | 63% | +34 pp |
| 2M (estimated frontier, 2026) | 90% | 72% | +45 pp |
If you fix the embodiment (train and test on the same robot, just with different objects or layouts), success rate plateaus near 85 percent at 30k-100k trajectories. The 13-points-per-doubling exponent above kicks in only for the out-of-distribution robot, object, or layout case. The interpretation is that within-embodiment perception and control saturate at modest scale, but generalization across new bodies and new tasks requires the diversity that only cross-embodiment data provides. This is the empirical justification for the Open X-Embodiment effort: the data has to span robots, not just trajectories.
24.4.4 The RT-2-X Architecture Relative to RT-2
Architecturally, RT-2-X is essentially RT-2 with no meaningful changes; the contribution is the data. RT-2 (Brohan et al., 2023, arXiv:2307.15818) is a 55B-parameter PaLM-X-derived backbone (or a 5B-parameter PaLI derivative, in the smaller variant), with the same vocabulary-extension trick OpenVLA uses. RT-2-X swaps the training mixture from the RT-1 dataset alone to the full OXE mixture and otherwise keeps the model code unchanged. The result is a model that retains RT-2's web-scale visual reasoning, gains the cross-embodiment generalization of OXE, and demonstrates emergent capabilities not present in either ingredient alone (the most-cited example: RT-2-X can manipulate objects whose names were never in any robot dataset, but appear in the web-scale pretraining corpus of the PaLM backbone).
# Conceptual sketch of the RT-2-X training pipeline.
# The data side does the heavy lifting; the model side reuses RT-2 unchanged.
def build_rt2x_training_step(model, oxe_mixture, action_tokenizer):
def step(batch):
# 1. Mix from 22 robots according to the magic-soup weights.
obs, action, lang = oxe_mixture.sample(batch)
# 2. Map each robot's native action into the canonical 7-D space.
canonical_action = oxe_mixture.to_canonical(action, batch.embodiment)
# 3. Tokenize image, text, and action into one sequence.
action_tokens = action_tokenizer(canonical_action)
sequence = model.build_io_sequence(obs, lang, action_tokens)
# 4. Standard next-token cross-entropy loss.
logits = model(sequence)
return cross_entropy(logits[:, :-1], sequence[:, 1:])
return step
24.4.5 What the Curves Imply for Practitioners
If you are a robotics team in 2026 planning a VLA project, the scaling curves give three actionable rules. First, do not train a VLA on a single robot if you can avoid it; even a small admixture of OXE data (10 to 20 percent) measurably lifts out-of-distribution generalization. Second, the marginal value of collecting your 1,001st demonstration on your specific robot is much higher than the marginal value of the 100,001st OXE trajectory; small high-quality data scoped to your task complements rather than competes with the OXE base. Third, language quality matters disproportionately: trajectories with carefully worded instructions are worth roughly 3x what trajectories with auto-generated or noisy instructions are worth, in terms of downstream success rate. The implication is that hiring a contractor to relabel your demonstrations is often the highest-ROI data work you can do.
The empirically dominant recipe for shipping a VLA in 2026 is: (a) start from OpenVLA or pi-0-fast as the pretrained checkpoint; (b) collect ~1,000 high-quality teleop demonstrations on your specific robot doing your specific task family; (c) LoRA-finetune the pretrained model on a 90/10 mixture of your data and OXE data, with the OXE side mostly intact to prevent catastrophic forgetting. This recipe consistently delivers 75-85 percent success rates on production manipulation tasks, with under $20k in data-collection cost and under $5k in compute. It is the closest thing the field has to a "just works" pattern.
24.4.6 The Data Frontier and What 2026 Looks Like
The OXE v1 mixture froze in 2024. Two efforts are pushing the data frontier in 2026. The DROID dataset (Khazatsky et al., 2024) adds 76,000 trajectories of contact-rich manipulation across 18 institutions, and is approximately the same scale as BridgeData-V2 but with substantially higher dexterity content. The AutoRT effort at Google DeepMind (Ahn et al., 2024) uses LLMs to autonomously direct robots in real-world settings, generating roughly 77,000 trajectories per day across a fleet of mobile manipulators in office environments. The combination of these two streams, plus the unreleased Physical Intelligence teleop corpus, is plausibly pushing the effective training-data scale past 5 million trajectories by mid-2026, with the predicted (extrapolated) out-of-distribution success rates above 80 percent.
Is robot data subject to the same "synthetic data wall" debates as text data? Early evidence suggests yes but with a different shape. Simulators (Isaac Sim, MuJoCo, ManiSkill) can generate trajectories cheaply, but the sim-to-real gap (covered in detail in Section 24.13) means raw synthetic trajectories produce policies that fail on real hardware. The current consensus is that synthetic data is useful as a pretraining ingredient (it teaches the model basic physics) but cannot replace real demonstrations for the action distribution. This is roughly analogous to the text setting, where synthetic data is useful for instruction tuning but cannot replace the diversity of web text.
24.4.7 When RT-2-X Is the Wrong Answer
The RT-2-X family is not open-weights. Google has released the training methodology and partial reproductions exist (OpenVLA is the closest open-weights cousin), but the production RT-2-X checkpoints remain proprietary. For most practical purposes you should treat RT-2-X as the research artifact that established the scaling laws and use OpenVLA or pi-0 as the working implementation. The exceptions are (a) you have access to a frontier-lab partnership, (b) you are explicitly studying the scaling laws and need to reproduce them, or (c) you are at one of the contributing institutions and have inherited a working RT-2-X-class stack. In all three cases the architectural details in this section transfer directly.
Key Takeaway
RT-2-X is RT-2 trained on Open X-Embodiment. The architectural delta is zero; the data delta is the entire story. Out-of-distribution success rate scales as a power law in the number of cross-embodiment training trajectories with exponent roughly 0.21, meaning each doubling buys about 13 percentage points. The practical implication is that "more diverse robots in training" beats "bigger model" by a wide margin for any team within shouting distance of the open-weights frontier.
SR_OOD ~ 1 - 6.3 * N^{-0.21}, estimate your expected out-of-distribution success rate. How much would you need to collect to reach 80 percent?Show Answer
Show Answer
Show Answer
Continue to Section 24.5: Comparing VLA Models.
Sections 39.2-39.4 covered three concrete VLA models (OpenVLA, pi-0, RT-2-X). Section 24.5 lines them up side by side with a capability matrix that lets you pick the right model for a given application. Then Section 24.6 turns to the limitations that all three share.