Section 24.4: RT-2-X & the Data-Scaling Story

"We mixed 22 robot datasets in one tokenizer, and somewhere a control theorist quietly closed their laptop."
Scale, Data-Mixed-And-Proud AI Agent

Big Picture

RT-2-X (Open X-Embodiment Collaboration, 2024) is the result you get when you take the RT-2 architecture and train it on the union of 21 institutions' robot data. The headline finding is that cross-embodiment scaling laws look exactly like the text-LLM scaling laws of Chapter 6: doubling data improves out-of-distribution success rate by a power-law amount, regardless of the source robot. This section lays out the cross-embodiment data recipe, the empirical scaling curves, and the practical guidance the curves imply for anyone training a VLA in 2026.

Prerequisites

This section assumes the VLA architecture from Section 24.1 and the scaling-law intuitions from Section 6.3. Familiarity with cross-domain transfer from Section 13.4 helps with the cross-embodiment discussion.

24.4.1 What "Cross-Embodiment" Means in Practice

Fun Fact

The Open X-Embodiment dataset that powered RT-2-X pooled robot demonstrations from 34 research labs across 22 different robot platforms, totaling more than a million trajectories. It was the closest thing robotics had ever produced to an ImageNet moment, and it shipped roughly one decade after ImageNet itself, which is a humbling reminder that robots are way harder than cats. The dataset's logo cheerfully includes a Franka, a UR5, and a wheeled mobile manipulator standing in a row like a very stiff class photo.

Cross-embodiment training means a single model parameterized by a single action vocabulary learns to control many different robot bodies. The technical obstacle is that a 7-DOF Franka, a 6-DOF UR5, a wheeled mobile manipulator with a 7-DOF arm, and a bimanual 14-DOF ALOHA have action spaces that are not directly comparable. The Open X-Embodiment solution is a unified action space: all actions are expressed as 7-D end-effector deltas in a canonical world frame, plus a binary gripper command. Each contributing dataset ships a small adapter that maps its native action representation into this canonical 7-D space. The adapter is one of three pieces of robotics-specific code in the entire training stack; the other two are the action tokenizer and the image preprocessing.

Robot family	Native action space	Canonical 7-D mapping
Franka FR3 (7-DOF arm)	Joint position commands	Forward-kinematics to end-effector delta
UR5e (6-DOF arm)	End-effector pose deltas	Direct, drop the redundant DOF
WidowX (6-DOF arm)	End-effector pose deltas	Direct
ALOHA (bimanual 14-DOF)	Joint position per arm	Two 7-D deltas, one per arm
Mobile manipulator	Base velocity + arm joints	Arm gets 7-D delta; base gets separate channel
Humanoid	Whole-body 26-30 DOF	Decomposed into multiple 7-D streams

Table 24.4.1: How each robot family maps into the canonical 7-D end-effector action space used by RT-2-X. The mapping loses information (a Franka's null-space is averaged out), but the loss is small enough that single-platform finetuning recovers most of the deficit.

Key Insight: The unified action space is the trick

It is easy to miss this point because it is buried in the data-preprocessing code. The reason a single transformer can drive 22 different robots is not architectural; it is that all 22 robots have been forced to share an action vocabulary upstream of the model. The model only ever sees the 7-D canonical deltas. From its perspective, "training on 22 robots" is the same problem as "training on 22 different camera angles of the same robot". This is the same engineering move that BPE played for text: hide the underlying messiness (Unicode encodings, language differences) inside the tokenizer, so the model can pretend everything is just a token sequence.

24.4.2 The Open X-Embodiment Data Pyramid

The Open X-Embodiment v1 mixture (frozen in 2024) contains:

1.4 million trajectories total across 60 contributing datasets,
22 distinct robot embodiments (16 single-arm, 4 bimanual, 2 mobile manipulators),
970,000 trajectories after the OpenVLA "magic soup" filtering for usable language annotations and clean camera viewpoints,
~1.2 billion (image, language, action) tuples after timestep expansion, comparable in raw count to the GPT-2 training corpus.

The pyramid is sharply unbalanced: BridgeData-V2 alone is 60k trajectories, RT-1 is 130k, but most contributing datasets ship 1k to 5k trajectories. The "long tail" of small datasets matters more than the count suggests because each small dataset is the only source of a particular robot embodiment, camera arrangement, or task type. Removing the smallest 30 datasets (totaling under 5 percent of trajectories) drops out-of-distribution success rate by 15 percentage points, a finding that has shaped the design of v2 mixtures now under preparation.

# Inspecting the OXE mixture composition.
from openpi.training.config import OXE_MAGIC_SOUP

total_trajs = 0
for name, spec in sorted(OXE_MAGIC_SOUP.items(), key=lambda kv: -kv[1].weight):
    print(f"{name:30s}  weight={spec.weight:.4f}  n_trajs={spec.n_trajectories:6d}")
    total_trajs += spec.n_trajectories
print(f"Total: {total_trajs:,} trajectories")
# Top 3 lines (approximate):
#   bridge_orig                weight=0.2400  n_trajs= 60064
#   fractal20220817_data       weight=0.2300  n_trajs=130000
#   kuka                       weight=0.1100  n_trajs=200000
#   ...
#   Total: 968,123 trajectories

Code Fragment 24.4.1a: The OXE "magic soup" weights are not proportional to trajectory count. Heuristic upweighting of clean, well-annotated datasets (BridgeData, RT-1) and downweighting of noisier contributors gives a final training distribution that is materially different from a uniform sample.

24.4.3 The Scaling Curves That Came Out of RT-2-X

The RT-2-X paper reports a power-law relationship between training-data size and out-of-distribution success rate that closely mirrors text-LLM scaling. Let N be the number of training trajectories and let SR_OOD(N) be the success rate on a held-out task that was not in the training mixture. The empirical fit is approximately

SR_OOD(N) ~ 1 - C * N^(-alpha)

with alpha roughly 0.21 and C roughly 6.3, fit on the public benchmark suite. The interpretation: each doubling of cross-embodiment data buys you about 13 percentage points of out-of-distribution success rate, until you saturate near the high-90s. This is shockingly close to the text-LLM exponent of around 0.34 (Hoffmann et al., 2022, arXiv:2203.15556), the difference being that robot data costs roughly $5 per trajectory in human teleoperation time versus essentially free for web-scraped text.

Training trajectories	In-distribution success	Out-of-distribution success	Cross-embodiment transfer
10k (single robot)	78%	22%	n/a
100k (3 robots)	83%	41%	+9 pp over single-robot
500k (12 robots)	86%	55%	+24 pp
970k (22 robots, full OXE)	88%	63%	+34 pp
2M (estimated frontier, 2026)	90%	72%	+45 pp

Figure 24.4.1b: Empirical scaling of RT-2-X-class models on the OXE evaluation suite. Numbers compiled from RT-2-X paper plus follow-up work; the 2M-trajectory row is extrapolated. Out-of-distribution success is what scales the most aggressively; in-distribution success is already near saturation at moderate data scales.

Key Insight

Cross-embodiment is the only thing that scales OOD success

If you fix the embodiment (train and test on the same robot, just with different objects or layouts), success rate plateaus near 85 percent at 30k-100k trajectories. The 13-points-per-doubling exponent above kicks in only for the out-of-distribution robot, object, or layout case. The interpretation is that within-embodiment perception and control saturate at modest scale, but generalization across new bodies and new tasks requires the diversity that only cross-embodiment data provides. This is the empirical justification for the Open X-Embodiment effort: the data has to span robots, not just trajectories.

24.4.4 The RT-2-X Architecture Relative to RT-2

Architecturally, RT-2-X is essentially RT-2 with no meaningful changes; the contribution is the data. RT-2 (Brohan et al., 2023, arXiv:2307.15818) is a 55B-parameter PaLM-X-derived backbone (or a 5B-parameter PaLI derivative, in the smaller variant), with the same vocabulary-extension trick OpenVLA uses. RT-2-X swaps the training mixture from the RT-1 dataset alone to the full OXE mixture and otherwise keeps the model code unchanged. The result is a model that retains RT-2's web-scale visual reasoning, gains the cross-embodiment generalization of OXE, and demonstrates emergent capabilities not present in either ingredient alone (the most-cited example: RT-2-X can manipulate objects whose names were never in any robot dataset, but appear in the web-scale pretraining corpus of the PaLM backbone).

# Conceptual sketch of the RT-2-X training pipeline.
# The data side does the heavy lifting; the model side reuses RT-2 unchanged.

def build_rt2x_training_step(model, oxe_mixture, action_tokenizer):
    def step(batch):
        # 1. Mix from 22 robots according to the magic-soup weights.
        obs, action, lang = oxe_mixture.sample(batch)

        # 2. Map each robot's native action into the canonical 7-D space.
        canonical_action = oxe_mixture.to_canonical(action, batch.embodiment)

        # 3. Tokenize image, text, and action into one sequence.
        action_tokens = action_tokenizer(canonical_action)
        sequence = model.build_io_sequence(obs, lang, action_tokens)

        # 4. Standard next-token cross-entropy loss.
        logits = model(sequence)
        return cross_entropy(logits[:, :-1], sequence[:, 1:])
    return step

Code Fragment 24.4.2: RT-2-X's training loop is unremarkable. The data plumbing (step 1, sampling from 22 robots; step 2, canonicalizing action spaces) is the entire contribution. The model code is RT-2 byte-for-byte.

24.4.5 What the Curves Imply for Practitioners

If you are a robotics team in 2026 planning a VLA project, the scaling curves give three actionable rules. First, do not train a VLA on a single robot if you can avoid it; even a small admixture of OXE data (10 to 20 percent) measurably lifts out-of-distribution generalization. Second, the marginal value of collecting your 1,001st demonstration on your specific robot is much higher than the marginal value of the 100,001st OXE trajectory; small high-quality data scoped to your task complements rather than competes with the OXE base. Third, language quality matters disproportionately: trajectories with carefully worded instructions are worth roughly 3x what trajectories with auto-generated or noisy instructions are worth, in terms of downstream success rate. The implication is that hiring a contractor to relabel your demonstrations is often the highest-ROI data work you can do.

Real-World Scenario: The OXE-plus-1k recipe

The empirically dominant recipe for shipping a VLA in 2026 is: (a) start from OpenVLA or pi-0-fast as the pretrained checkpoint; (b) collect ~1,000 high-quality teleop demonstrations on your specific robot doing your specific task family; (c) LoRA-finetune the pretrained model on a 90/10 mixture of your data and OXE data, with the OXE side mostly intact to prevent catastrophic forgetting. This recipe consistently delivers 75-85 percent success rates on production manipulation tasks, with under $20k in data-collection cost and under $5k in compute. It is the closest thing the field has to a "just works" pattern.

24.4.6 The Data Frontier and What 2026 Looks Like

The OXE v1 mixture froze in 2024. Two efforts are pushing the data frontier in 2026. The DROID dataset (Khazatsky et al., 2024) adds 76,000 trajectories of contact-rich manipulation across 18 institutions, and is approximately the same scale as BridgeData-V2 but with substantially higher dexterity content. The AutoRT effort at Google DeepMind (Ahn et al., 2024) uses LLMs to autonomously direct robots in real-world settings, generating roughly 77,000 trajectories per day across a fleet of mobile manipulators in office environments. The combination of these two streams, plus the unreleased Physical Intelligence teleop corpus, is plausibly pushing the effective training-data scale past 5 million trajectories by mid-2026, with the predicted (extrapolated) out-of-distribution success rates above 80 percent.

Research Frontier: Synthetic data: the open question

Is robot data subject to the same "synthetic data wall" debates as text data? Early evidence suggests yes but with a different shape. Simulators (Isaac Sim, MuJoCo, ManiSkill) can generate trajectories cheaply, but the sim-to-real gap (covered in detail in Section 24.13) means raw synthetic trajectories produce policies that fail on real hardware. The current consensus is that synthetic data is useful as a pretraining ingredient (it teaches the model basic physics) but cannot replace real demonstrations for the action distribution. This is roughly analogous to the text setting, where synthetic data is useful for instruction tuning but cannot replace the diversity of web text.

24.4.7 When RT-2-X Is the Wrong Answer

The RT-2-X family is not open-weights. Google has released the training methodology and partial reproductions exist (OpenVLA is the closest open-weights cousin), but the production RT-2-X checkpoints remain proprietary. For most practical purposes you should treat RT-2-X as the research artifact that established the scaling laws and use OpenVLA or pi-0 as the working implementation. The exceptions are (a) you have access to a frontier-lab partnership, (b) you are explicitly studying the scaling laws and need to reproduce them, or (c) you are at one of the contributing institutions and have inherited a working RT-2-X-class stack. In all three cases the architectural details in this section transfer directly.

Key Takeaway

Key Insight

RT-2-X is RT-2 trained on Open X-Embodiment. The architectural delta is zero; the data delta is the entire story. Out-of-distribution success rate scales as a power law in the number of cross-embodiment training trajectories with exponent roughly 0.21, meaning each doubling buys about 13 percentage points. The practical implication is that "more diverse robots in training" beats "bigger model" by a wide margin for any team within shouting distance of the open-weights frontier.

Self-Check

Q1: Suppose your team has 5,000 demonstrations of a Franka arm. Using the scaling curve SR_OOD ~ 1 - 6.3 * N^{-0.21}, estimate your expected out-of-distribution success rate. How much would you need to collect to reach 80 percent?

Show Answer

Plug in $N = 5{,}000$: $N^{-0.21} = 5000^{-0.21} \approx e^{-0.21 \ln 5000} \approx e^{-1.79} \approx 0.167$, so $SR_{OOD} \approx 1 - 6.3 \times 0.167 \approx -0.05$. The fitted curve is meaningful only for $N$ above a few hundred-thousand; at 5k single-robot trajectories the model is far below the cross-embodiment scaling regime and OOD performance is essentially zero. To reach 80%, solve $0.2 = 6.3 \times N^{-0.21}$ which gives $N \approx (6.3 / 0.2)^{1/0.21} \approx 31.5^{4.76} \approx 2.1$ million trajectories. The honest takeaway is that single-robot data does not scale to OOD; you need OXE diversity to enter the regime where this curve applies.

Q2: Why is the "canonical 7-D action space" trick essential for cross-embodiment training? What goes wrong if you let each robot keep its native action space and add a per-robot output head?

Show Answer

A canonical action space lets the transformer treat all 22 robots as draws from a single distribution; cross-robot patterns (a grasp motion looks similar across embodiments) become visible to the loss and gradients are shared across the corpus. Per-robot output heads partition the parameters: each head sees only the trajectories from its own robot, so the cross-embodiment generalization that drives the scaling law disappears. The trick is the same one BPE played for multilingual text: hide the messy heterogeneity (Unicode encodings, action conventions) inside the tokenizer/canonicalizer so the model sees a uniform token stream. Without it, "training on 22 robots" reduces to "training 22 separate small policies that happen to share a trunk".

Q3: The DROID dataset is 76k trajectories; the OXE base is ~970k. By the scaling law, what fraction of out-of-distribution success-rate improvement does DROID add? (Hint: think in log scale.)

Show Answer

Adding 76k to 970k brings the total to roughly 1.05M, a 1.08x increase. The scaling curve $SR_{OOD} = 1 - C N^{-\alpha}$ has slope in log-N: the gap to perfect $1 - SR$ scales as $N^{-0.21}$, so the relative improvement from $N$ to $N'$ is $(N'/N)^{-0.21}$. Here $(1.05 / 0.97)^{-0.21} \approx 1.08^{-0.21} \approx 0.984$, so the gap shrinks by about 1.6%, which translates to under 1 percentage point of SR improvement at current OOD success around 63%. DROID's value is not in raw count but in dexterity content; data quality and task diversity matter more than the log-scale numbers suggest for the kinds of contact-rich manipulation DROID emphasizes.

What's Next

Continue to Section 24.5: Comparing VLA Models.

Sections 39.2-39.4 covered three concrete VLA models (OpenVLA, pi-0, RT-2-X). Section 24.5 lines them up side by side with a capability matrix that lets you pick the right model for a given application. Then Section 24.6 turns to the limitations that all three share.

Further Reading

Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. "ICRA 2024, arXiv:2310.08864".

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. "arXiv:2307.15818".

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Khazatsky, A., et al. (2024). DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset. "RSS 2024, arXiv:2403.12945".

Ahn, M., et al. (2024). AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. "arXiv:2401.12963".

Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. "NeurIPS 2022, arXiv:2203.15556".

Walke, H., et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale. "CoRL 2023, arXiv:2308.12952".