Sim-to-Real Gap

Section 24.13

"Simulation said yes. Reality said maybe. The gap between yes and maybe is where the PhD thesis lives."

EvalEval, Sim2Real-Pessimist AI Agent
Big Picture

A VLA is a Vision-Language-Action model (an LLM-style policy that maps images and text instructions to robot motor commands, covered in Sections 24.1-24.4); "sim-to-real" is the standing problem of getting a policy trained in a physics simulator to behave correctly on a physical robot. Every working robotics deployment crosses the sim-to-real gap one way or another. The pattern matters because (a) simulation is cheap and real-robot time is expensive, (b) modern VLAs and LLM planners can be developed entirely in sim until the last mile, and (c) the gap-closing techniques (domain randomization, system identification, real-world fine-tuning, hardware-in-the-loop) compose into a deployment playbook that has stabilized in 2025-2026. This section is that playbook. It is the bridge between the architectural content of Sections 24.1-24.6 and the actual deployment of a robot that customers use.

Prerequisites

This section assumes the VLA architectures from Section 24.1 through Section 24.4. Evaluation methodology for embodied agents is covered in detail later in the book.

24.13.1 Anatomy of the Gap

Fun Fact

Domain randomization (Tobin et al., 2017) was popularized by OpenAI's robot-hand work, which trained a policy to solve a Rubik's cube one-handed in simulation by randomizing literally every simulator parameter the team could think of, including gravity, friction, and the masses of individual cube facelets. The trained policy famously did not need real-world data to transfer, but it did need the team to break and replace the physical hand multiple times because the policy was forceful in ways the hardware engineers had not anticipated.

The sim-to-real gap decomposes into four independent error sources. Knowing which dominates your task is the first step in closing the gap.

Error sourceWhat it isHardest tasksBest mitigation
Visual domain shiftRendered images differ from real camera outputOpen environments, varied lightingDomain randomization + real-image co-training
Contact dynamicsSimulated friction/stiction is approximatedCloth, granular materials, contact-rich assemblySystem identification + real-data fine-tune
Actuator latencyReal motors have command-to-motion delayHigh-speed control, dynamic balanceLatency modeling + reactive policy training
Sensor noiseReal sensors have bias, dropouts, calibration driftForce/torque tasks, depth-based perceptionNoise injection during training + filtering
Table 24.13.1: The four error sources composing the sim-to-real gap. Most deployments are dominated by one or two sources; identifying which is the prerequisite for choosing a mitigation strategy.

24.13.2 Domain Randomization, the Workhorse

A giant hand vigorously shakes a snow globe that holds a tiny robot arm and table, with the words lighting, friction, mass, and camera_jitter swirling around inside while a calm researcher outside holds a clipboard labelled Domain Randomization.
Figure 24.13.1a: Domain randomization shakes the simulation until the model gives up on memorizing any one configuration and learns the underlying skill.

The most reliable bridging technique is domain randomization (Tobin et al., 2017, arXiv:1703.06907): randomize every simulator parameter that could differ from reality, train across the union of these variations, and rely on the policy to learn features that are invariant to the randomized properties. The randomized parameters depend on the task. For manipulation, the standard set is lighting (brightness, color temperature, shadow directions), camera position (pose jitter), object texture, object mass and friction, gripper compliance, and controller PID gains.

import numpy as np

class DomainRandomizer:
    """Sample a randomized simulator config for each training episode."""

    def __init__(self, rng=None):
        self.rng = rng or np.random.default_rng()

    def sample(self):
        return {
            "lighting": {
                "intensity": self.rng.uniform(0.3, 2.0),
                "color_temp_k": self.rng.uniform(3000, 7000),
                "shadow_direction": self.rng.uniform(-np.pi, np.pi),
            },
            "camera": {
                "position_jitter_m": self.rng.uniform(-0.05, 0.05, size=3),
                "focal_length_jitter": self.rng.uniform(0.95, 1.05),
            },
            "object": {
                "mass_kg": self.rng.uniform(0.05, 2.0),
                "friction": self.rng.uniform(0.3, 1.2),
                "texture_id": self.rng.integers(0, 1000),
            },
            "controller": {
                "kp_jitter": self.rng.uniform(0.7, 1.3),
                "latency_ms": self.rng.uniform(5, 40),
            },
            "sensor_noise": {
                "force_torque_sigma": self.rng.uniform(0.0, 0.5),
                "depth_dropout": self.rng.uniform(0.0, 0.1),
            },
        }
Code Fragment 24.13.1b: A representative domain-randomization configuration. The numerical ranges are calibrated to span the variation a real robot encounters; tighter ranges fail to bridge the gap, wider ranges produce policies that are too conservative. The ranges in this fragment are roughly the Tobin et al. defaults updated for 2026 manipulation tasks.
Key Insight: DR alone is not enough in 2026

The original sim-to-real-via-DR papers (2017-2019) demonstrated zero-shot transfer for navigation and simple pick-and-place. For 2026's contact-rich and deformable-object manipulation tasks, domain randomization alone closes only about 60-70 percent of the gap. The remaining 30-40 percent requires a co-training step with real-world data. The working recipe is "train on DR sim, fine-tune on a few hundred real demos", which is the recipe OpenVLA, pi-0, and most production systems use under the hood.

24.13.3 System Identification

The complement to randomization is identification: measure your real robot's properties (joint friction, controller gains, camera-to-base calibration) and set the simulator to match. If domain randomization is "train across many possible worlds", system identification is "shrink the variation by knowing the real world more precisely". Modern toolchains automate this: you run a calibration procedure on the real robot, the toolchain fits simulator parameters to the measurements, and the trained policy starts much closer to the real robot's actual dynamics.

# Sketch of automated system identification for a robot arm.
import numpy as np
from scipy.optimize import minimize

def identify_joint_friction(real_trajectories, sim):
    """Fit simulator joint friction coefficients to real trajectory data.

    real_trajectories: list of (t, q, qdot, tau) on the real robot
    sim: simulator object with friction setter and step method
    """
    def loss(friction):
        sim.set_friction(friction)
        err = 0.0
        for t, q, qdot, tau in real_trajectories:
            sim.reset(q=q[0], qdot=qdot[0])
            sim_q = []
            for tau_step in tau:
                sim.step(tau_step)
                sim_q.append(sim.get_q())
            err += np.mean((np.array(sim_q) - q[1:]) ** 2)
        return err / len(real_trajectories)

    result = minimize(loss, x0=np.ones(7) * 0.5, bounds=[(0.0, 2.0)] * 7)
    return result.x
Code Fragment 24.13.2: System identification for a 7-DOF arm. The script feeds real-robot torque commands into the simulator, measures the simulator's joint-trajectory error against the real data, and optimizes friction coefficients to minimize the error. The result is a simulator that matches your specific physical robot, not a generic robot of the same model. This step alone closes ~15 pp of the sim-to-real gap on dynamics-sensitive tasks.

24.13.4 Real-World Fine-Tuning

The third pillar is fine-tuning on real-world demonstrations. The recipe is straightforward: take the sim-trained policy, collect 100-1,000 real teleop demonstrations on the deployment robot, fine-tune the policy on the real data with a small learning rate. The result is a policy that retains the broad capability from sim training but adapts to the specific quirks of the deployment hardware. For VLAs, the LoRA recipe from Section 24.2 applies directly. For LLM planners the fine-tuning step is usually unnecessary because the planner is already operating at the abstraction layer above the dynamics gap.

RecipeSim-only success (%)Real success (%)Gap (pp)
Pure sim training (no DR)923458
Sim + domain randomization896326
Sim + DR + system ID897217
Sim + DR + 100 real demos fine-tune87789
Sim + DR + 1000 real demos fine-tune86842
Real-only training (10k demos)n/a81n/a
Table 24.13.2a: Sim-to-real transfer recipes ordered by real-world success. Adding domain randomization closes the gap from 58 to 26 points; system identification trims another 9; just 100 real teleop demonstrations cut the remaining gap nearly in half. The bottom row shows that 10,000 real-only demonstrations roughly match the best sim-augmented recipe, illustrating the data-cost argument for the sim-plus-real stack.

24.13.5 Hardware-in-the-Loop Validation

The final piece is the validation harness. A production deployment cannot push a new policy version directly to customer-facing robots; it needs a hardware-in-the-loop (HIL) stage where the policy runs on real hardware in a controlled lab environment against a regression test suite of tasks. The 2026 best practice is a CI/CD pipeline where: (a) each policy commit triggers a sim regression test, (b) sim-passing commits proceed to a HIL test on a single lab robot, (c) HIL-passing commits proceed to a canary deployment on 1-5 percent of fleet robots, (d) canary-passing commits roll out to the full fleet. This is the same canary-deployment pattern used in software engineering, transplanted to robotics.

# Sketch of a HIL regression test harness.
from dataclasses import dataclass

@dataclass
class HilTask:
    name: str
    instruction: str
    scene_setup_fn: callable          # physically configures the lab
    success_check_fn: callable        # observes the lab after the run
    timeout_s: float = 120.0

def run_hil_regression(policy, tasks: list[HilTask], required_pass_rate=0.85):
    results = []
    for task in tasks:
        task.scene_setup_fn()
        policy.run(task.instruction, timeout=task.timeout_s)
        success = task.success_check_fn()
        results.append((task.name, success))
    pass_rate = sum(s for _, s in results) / len(results)
    assert pass_rate >= required_pass_rate, f"HIL regression failed: {pass_rate:.2%} < {required_pass_rate:.2%}"
    return results
Code Fragment 24.13.3: A HIL regression test harness. Each task has a setup function (puts the lab in a known state), a success check (observes the lab after the run), and a timeout. The required-pass-rate threshold guards against silent regressions that would otherwise ship to production. Many production teams in 2026 maintain HIL suites of 50-200 tasks running nightly.
Warning: Test setup automation is the bottleneck

The scene-setup function in Code Fragment 24.13.3a is the practical bottleneck: how do you reset the lab to a known state between tasks? Three patterns exist. (a) Robot-resets-itself: a separate reset policy puts objects back where they belong; works for ~50 percent of tasks. (b) Human-resets: a lab technician resets between batches; expensive and the limit on test cadence. (c) Disposable scenes: each task uses fresh objects; expensive in materials but reliable. The choice between these is one of the under-appreciated capital expenditures in a robotics team's budget.

24.13.6 Deployment Patterns in 2026

The deployment playbook that has stabilized in 2026 has five stages: (1) develop the policy in pure simulation with aggressive domain randomization; (2) calibrate simulator parameters to the specific deployment hardware via system identification; (3) fine-tune on 100-1000 real demonstrations from the deployment robot; (4) validate on a HIL regression suite; (5) canary deploy and monitor a small fraction of the fleet before full rollout. The same five stages apply whether you are deploying a VLA, a SayCan planner, or a hybrid stack.

StageCompute costReal-robot timeEngineering team size
1. Sim training with DR100-1000 GPU-days02-3 ML engineers
2. System ID1-10 GPU-days~10 hours1 robotics engineer
3. Real-data fine-tune1-10 GPU-days40-200 hours teleop1 ML + 1-2 teleoperators
4. HIL validation~01-10 robot-hours / commit1 SRE
5. Canary deploy~0continuous monitoring1 SRE + on-call
Figure 24.13.2b: Cost breakdown of the five deployment stages. Stage 3 (real-data fine-tune) is the largest real-robot-time cost; stage 1 (sim training) is the largest compute cost. Teams that skip stages 2 and 3 typically discover they need them after a failed canary deploy, which is more expensive than doing them upfront.
Research Frontier: The "zero-shot real-world" dream

The asymptote that the field is reaching toward is "train entirely in simulation, deploy directly to any real robot with no real-world fine-tuning". As of Q1 2026 this is achievable for navigation in structured indoor environments, but not for manipulation. Two threads are pushing toward closing it. (a) Higher-fidelity simulators (NVIDIA Isaac Sim's GPU-accelerated cloth and fluid simulation, Genesis from Tsinghua/UCSD/UCLA in 2024) reduce the dynamics gap. (b) Vision foundation models pretrained on real-world images and fine-tuned to interpret rendered images close the visual gap. The optimistic forecast is that "zero-shot manipulation" becomes practical in 2027-2028 for simple tasks; pessimistic forecasts see the contact-rich gap persisting much longer.

Key Takeaway

Key Insight

The sim-to-real gap decomposes into visual, contact-dynamics, actuator-latency, and sensor-noise errors. The 2026 deployment playbook closes the gap in five stages: domain-randomized sim training, system identification, real-data fine-tuning (100-1000 demos), HIL regression testing, and canary deployment. The combined recipe gets you from 34 percent real-world success (pure sim training) to 78-84 percent (full pipeline), at one-tenth the real-robot data cost of pure-real training.

Lab
Run an Open-X-Embodiment Policy on a Simulated Pick-and-Place
Duration: ~60 minutes Intermediate

Objective

Load an OpenVLA-7B checkpoint (trained on the Open X-Embodiment dataset, Padalkar et al. 2023) and run it on a simulated pick-and-place task in the LIBERO benchmark. Measure success rate over 50 trial rollouts under two task variants (in-distribution and a held-out object category). The point is to feel the sim2real gap from the inside: how a frontier VLA performs in simulation, and where the manipulation-policy failure modes from this section actually show up.

Setup

You need a CUDA-capable GPU (12 GB or more for a 7B model in fp16; 8 GB is workable with int8 quantization), the OpenVLA-7B checkpoint (openvla/openvla-7b on Hugging Face, paper at arXiv:2406.09246), and the LIBERO benchmark (libero-project.github.io). For learners without a 12-GB GPU, the smaller RT-1-X policy and the simpler ManiSkill2 environments are an alternative.

pip install transformers accelerate torch torchvision robosuite libero gymnasium

Steps

  1. Install LIBERO and pick two task suites: the libero_spatial suite for in-distribution pick-and-place (training data has seen the object categories) and libero_object for the out-of-distribution variant (new object set).
  2. Load OpenVLA-7B in fp16 (or int8 if VRAM is tight). The model takes an RGB image, a natural-language instruction, and produces a 7-DoF action chunk.
  3. Run 25 rollouts in each suite. Each rollout is up to 400 environment steps; success is gripper-contact with the target object plus successful placement in the target zone within episode horizon.
  4. Tally success rate and the failure breakdown. The interesting categorization is grasp failure (touched but did not lift), trajectory failure (dropped en route), and placement failure (released in the wrong location). Each category points at a different stage of the VLA pipeline.
  5. Apply domain randomization at inference by perturbing the camera pose by plus or minus 5 cm and rerunning. The drop in success rate is the empirical version of the sim2real gap analysis from subsection 24.13.1.

Expected Output

A CSV of per-rollout success flags and failure categories. Published OpenVLA results on LIBERO-spatial report success rates around 75 to 85 percent in-distribution and 50 to 65 percent on LIBERO-object out-of-distribution; the camera-perturbation rerun typically drops both numbers 10 to 15 percentage points, which is the closest single-experiment proxy for the sim2real gap.

Extension

Fine-tune OpenVLA-7B with 10 demonstrations on a single LIBERO-object task using LoRA, and re-evaluate; the standard finding is that a few-shot fine-tune recovers most of the out-of-distribution gap at a small fraction of the cost of training a new policy.

Self-Check
Q1: For each of the four sim-to-real error sources in Figure 24.13.1c, identify which of the five deployment stages most directly mitigates it.
Show Answer
Visual error (lighting, texture, camera) is mitigated by stage one, domain-randomized sim training; you spread the visual distribution over a wider envelope than reality, so the real-world distribution falls inside it. Contact-dynamics error (friction, compliance) is mitigated by stage two, system identification; you fit a small parameter set (mass, friction, damping) from real measurements and inject it into the simulator. Actuator-latency error is mitigated by stage three, real-data fine-tuning, because latency interacts with the policy in ways that no calibration captures cleanly. Sensor-noise error is mitigated by stages one and four together: randomization handles the noise distribution, and HIL regression testing catches the long-tail noise events the randomization missed.
Q2: Estimate the wall-clock time for a single iteration of the full deployment pipeline (stages 1-5) for a new robot model, assuming a 4-person team and one lab robot.
Show Answer
Stage one (domain-randomized sim training) takes one to three weeks of cluster time but is overlapped with stages two and three; call it three weeks of wall-clock. Stage two (system identification) is two days of real-robot measurements plus three days of parameter fitting. Stage three (real-data fine-tuning on 100-1000 demos) is one to two weeks for data collection on a single robot and two days of training. Stage four (HIL regression) is one week to author the test suite and one day per pipeline cycle thereafter. Stage five (canary deployment) is two weeks of progressively expanded rollout with daily review. End-to-end, expect six to eight weeks for the first iteration; the second iteration drops to two to three weeks because the test suite and fine-tuning infrastructure are reusable.
Q3: The "zero-shot real-world" dream is closer for navigation than for manipulation. Why? Tie your answer to which error sources dominate each task category.
Show Answer
Navigation is dominated by visual error and high-level geometry, both of which simulators model well and randomization covers cheaply; contact dynamics enter only at the wheel-ground interface and are coarsely correct in standard physics engines. Manipulation is dominated by contact dynamics: friction, deformation, slip, and tactile noise interact in ways that simulators still get wrong by 10 to 30 percent of the realistic distribution. The mismatch shows up directly in real-world success rates: navigation policies trained zero-shot in sim hit 70-85 percent real-world success, while manipulation policies hit 30-40 percent without real-data fine-tuning. Closing the contact-dynamics gap is an open research problem (differentiable physics, sim2real distillation), which is why Stage 3 fine-tuning remains mandatory for manipulation in 2026.
What's Next

Continue to Section 25.1: Platforms.

Chapter 39 closes the LLM-robotics arc that Chapter 24 began. Chapter 24 moves to world models, the next abstraction layer that lets a policy predict the consequences of its actions before executing them. World models, sim-to-real, and VLA all share a common substrate (next-frame prediction); seeing the connection is what makes 2026 robotics feel coherent rather than a grab bag of techniques.

Further Reading

Foundational VLA Papers

Brohan, A., et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv preprint. arXiv:2212.06817
Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. arXiv:2307.15818
Kim, M. J., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." CoRL 2024. arXiv:2406.09246
Black, K., et al. (2024). "Pi-0: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence. arXiv:2410.24164
Covariant (2024). "RFM-1: Robotics Foundation Model." Covariant Technical Report. covariant.ai/insights/introducing-rfm-1

Simulators and Benchmarks

Makoviychuk, V., et al. (2021). "Isaac Gym: High Performance GPU-Based Physics Simulation for Robot Learning." NeurIPS Datasets 2021. arXiv:2108.10470
Mittal, M., et al. (2023). "Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments." RA-L 2023. arXiv:2301.04195
Li, X., et al. (2024). "Evaluating Real-World Robot Manipulation Policies in Simulation (SimplerEnv)." CoRL 2024. arXiv:2405.05941
Mu, T., et al. (2021). "ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations." NeurIPS Datasets 2021. arXiv:2107.14483
Genesis Authors (2024). "Genesis: A Universal and Generative Physics Engine for Robotics and Beyond." genesis-world.readthedocs.io. genesis-world.readthedocs.io

Datasets

Khazatsky, A., et al. (2024). "DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset." RSS 2024. arXiv:2403.12945
Open X-Embodiment Collaboration (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. arXiv:2310.08864

Sim-to-Real Techniques

Tobin, J., et al. (2017). "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IROS 2017. arXiv:1703.06907
OpenAI, et al. (2019). "Solving Rubik's Cube with a Robot Hand." arXiv preprint. arXiv:1910.07113
Peng, X. B., et al. (2018). "Sim-to-Real Transfer of Robotic Control with Dynamics Randomization." ICRA 2018. arXiv:1710.06537