"Simulation said yes. Reality said maybe. The gap between yes and maybe is where the PhD thesis lives."
Eval, Sim2Real-Pessimist AI Agent
A VLA is a Vision-Language-Action model (an LLM-style policy that maps images and text instructions to robot motor commands, covered in Sections 24.1-24.4); "sim-to-real" is the standing problem of getting a policy trained in a physics simulator to behave correctly on a physical robot. Every working robotics deployment crosses the sim-to-real gap one way or another. The pattern matters because (a) simulation is cheap and real-robot time is expensive, (b) modern VLAs and LLM planners can be developed entirely in sim until the last mile, and (c) the gap-closing techniques (domain randomization, system identification, real-world fine-tuning, hardware-in-the-loop) compose into a deployment playbook that has stabilized in 2025-2026. This section is that playbook. It is the bridge between the architectural content of Sections 24.1-24.6 and the actual deployment of a robot that customers use.
Prerequisites
This section assumes the VLA architectures from Section 24.1 through Section 24.4. Evaluation methodology for embodied agents is covered in detail later in the book.
24.13.1 Anatomy of the Gap
Domain randomization (Tobin et al., 2017) was popularized by OpenAI's robot-hand work, which trained a policy to solve a Rubik's cube one-handed in simulation by randomizing literally every simulator parameter the team could think of, including gravity, friction, and the masses of individual cube facelets. The trained policy famously did not need real-world data to transfer, but it did need the team to break and replace the physical hand multiple times because the policy was forceful in ways the hardware engineers had not anticipated.
The sim-to-real gap decomposes into four independent error sources. Knowing which dominates your task is the first step in closing the gap.
| Error source | What it is | Hardest tasks | Best mitigation |
|---|---|---|---|
| Visual domain shift | Rendered images differ from real camera output | Open environments, varied lighting | Domain randomization + real-image co-training |
| Contact dynamics | Simulated friction/stiction is approximated | Cloth, granular materials, contact-rich assembly | System identification + real-data fine-tune |
| Actuator latency | Real motors have command-to-motion delay | High-speed control, dynamic balance | Latency modeling + reactive policy training |
| Sensor noise | Real sensors have bias, dropouts, calibration drift | Force/torque tasks, depth-based perception | Noise injection during training + filtering |
24.13.2 Domain Randomization, the Workhorse
The most reliable bridging technique is domain randomization (Tobin et al., 2017, arXiv:1703.06907): randomize every simulator parameter that could differ from reality, train across the union of these variations, and rely on the policy to learn features that are invariant to the randomized properties. The randomized parameters depend on the task. For manipulation, the standard set is lighting (brightness, color temperature, shadow directions), camera position (pose jitter), object texture, object mass and friction, gripper compliance, and controller PID gains.
import numpy as np
class DomainRandomizer:
"""Sample a randomized simulator config for each training episode."""
def __init__(self, rng=None):
self.rng = rng or np.random.default_rng()
def sample(self):
return {
"lighting": {
"intensity": self.rng.uniform(0.3, 2.0),
"color_temp_k": self.rng.uniform(3000, 7000),
"shadow_direction": self.rng.uniform(-np.pi, np.pi),
},
"camera": {
"position_jitter_m": self.rng.uniform(-0.05, 0.05, size=3),
"focal_length_jitter": self.rng.uniform(0.95, 1.05),
},
"object": {
"mass_kg": self.rng.uniform(0.05, 2.0),
"friction": self.rng.uniform(0.3, 1.2),
"texture_id": self.rng.integers(0, 1000),
},
"controller": {
"kp_jitter": self.rng.uniform(0.7, 1.3),
"latency_ms": self.rng.uniform(5, 40),
},
"sensor_noise": {
"force_torque_sigma": self.rng.uniform(0.0, 0.5),
"depth_dropout": self.rng.uniform(0.0, 0.1),
},
}
The original sim-to-real-via-DR papers (2017-2019) demonstrated zero-shot transfer for navigation and simple pick-and-place. For 2026's contact-rich and deformable-object manipulation tasks, domain randomization alone closes only about 60-70 percent of the gap. The remaining 30-40 percent requires a co-training step with real-world data. The working recipe is "train on DR sim, fine-tune on a few hundred real demos", which is the recipe OpenVLA, pi-0, and most production systems use under the hood.
24.13.3 System Identification
The complement to randomization is identification: measure your real robot's properties (joint friction, controller gains, camera-to-base calibration) and set the simulator to match. If domain randomization is "train across many possible worlds", system identification is "shrink the variation by knowing the real world more precisely". Modern toolchains automate this: you run a calibration procedure on the real robot, the toolchain fits simulator parameters to the measurements, and the trained policy starts much closer to the real robot's actual dynamics.
# Sketch of automated system identification for a robot arm.
import numpy as np
from scipy.optimize import minimize
def identify_joint_friction(real_trajectories, sim):
"""Fit simulator joint friction coefficients to real trajectory data.
real_trajectories: list of (t, q, qdot, tau) on the real robot
sim: simulator object with friction setter and step method
"""
def loss(friction):
sim.set_friction(friction)
err = 0.0
for t, q, qdot, tau in real_trajectories:
sim.reset(q=q[0], qdot=qdot[0])
sim_q = []
for tau_step in tau:
sim.step(tau_step)
sim_q.append(sim.get_q())
err += np.mean((np.array(sim_q) - q[1:]) ** 2)
return err / len(real_trajectories)
result = minimize(loss, x0=np.ones(7) * 0.5, bounds=[(0.0, 2.0)] * 7)
return result.x
24.13.4 Real-World Fine-Tuning
The third pillar is fine-tuning on real-world demonstrations. The recipe is straightforward: take the sim-trained policy, collect 100-1,000 real teleop demonstrations on the deployment robot, fine-tune the policy on the real data with a small learning rate. The result is a policy that retains the broad capability from sim training but adapts to the specific quirks of the deployment hardware. For VLAs, the LoRA recipe from Section 24.2 applies directly. For LLM planners the fine-tuning step is usually unnecessary because the planner is already operating at the abstraction layer above the dynamics gap.
| Recipe | Sim-only success (%) | Real success (%) | Gap (pp) |
|---|---|---|---|
| Pure sim training (no DR) | 92 | 34 | 58 |
| Sim + domain randomization | 89 | 63 | 26 |
| Sim + DR + system ID | 89 | 72 | 17 |
| Sim + DR + 100 real demos fine-tune | 87 | 78 | 9 |
| Sim + DR + 1000 real demos fine-tune | 86 | 84 | 2 |
| Real-only training (10k demos) | n/a | 81 | n/a |
24.13.5 Hardware-in-the-Loop Validation
The final piece is the validation harness. A production deployment cannot push a new policy version directly to customer-facing robots; it needs a hardware-in-the-loop (HIL) stage where the policy runs on real hardware in a controlled lab environment against a regression test suite of tasks. The 2026 best practice is a CI/CD pipeline where: (a) each policy commit triggers a sim regression test, (b) sim-passing commits proceed to a HIL test on a single lab robot, (c) HIL-passing commits proceed to a canary deployment on 1-5 percent of fleet robots, (d) canary-passing commits roll out to the full fleet. This is the same canary-deployment pattern used in software engineering, transplanted to robotics.
# Sketch of a HIL regression test harness.
from dataclasses import dataclass
@dataclass
class HilTask:
name: str
instruction: str
scene_setup_fn: callable # physically configures the lab
success_check_fn: callable # observes the lab after the run
timeout_s: float = 120.0
def run_hil_regression(policy, tasks: list[HilTask], required_pass_rate=0.85):
results = []
for task in tasks:
task.scene_setup_fn()
policy.run(task.instruction, timeout=task.timeout_s)
success = task.success_check_fn()
results.append((task.name, success))
pass_rate = sum(s for _, s in results) / len(results)
assert pass_rate >= required_pass_rate, f"HIL regression failed: {pass_rate:.2%} < {required_pass_rate:.2%}"
return results
The scene-setup function in Code Fragment 24.13.3a is the practical bottleneck: how do you reset the lab to a known state between tasks? Three patterns exist. (a) Robot-resets-itself: a separate reset policy puts objects back where they belong; works for ~50 percent of tasks. (b) Human-resets: a lab technician resets between batches; expensive and the limit on test cadence. (c) Disposable scenes: each task uses fresh objects; expensive in materials but reliable. The choice between these is one of the under-appreciated capital expenditures in a robotics team's budget.
24.13.6 Deployment Patterns in 2026
The deployment playbook that has stabilized in 2026 has five stages: (1) develop the policy in pure simulation with aggressive domain randomization; (2) calibrate simulator parameters to the specific deployment hardware via system identification; (3) fine-tune on 100-1000 real demonstrations from the deployment robot; (4) validate on a HIL regression suite; (5) canary deploy and monitor a small fraction of the fleet before full rollout. The same five stages apply whether you are deploying a VLA, a SayCan planner, or a hybrid stack.
| Stage | Compute cost | Real-robot time | Engineering team size |
|---|---|---|---|
| 1. Sim training with DR | 100-1000 GPU-days | 0 | 2-3 ML engineers |
| 2. System ID | 1-10 GPU-days | ~10 hours | 1 robotics engineer |
| 3. Real-data fine-tune | 1-10 GPU-days | 40-200 hours teleop | 1 ML + 1-2 teleoperators |
| 4. HIL validation | ~0 | 1-10 robot-hours / commit | 1 SRE |
| 5. Canary deploy | ~0 | continuous monitoring | 1 SRE + on-call |
The asymptote that the field is reaching toward is "train entirely in simulation, deploy directly to any real robot with no real-world fine-tuning". As of Q1 2026 this is achievable for navigation in structured indoor environments, but not for manipulation. Two threads are pushing toward closing it. (a) Higher-fidelity simulators (NVIDIA Isaac Sim's GPU-accelerated cloth and fluid simulation, Genesis from Tsinghua/UCSD/UCLA in 2024) reduce the dynamics gap. (b) Vision foundation models pretrained on real-world images and fine-tuned to interpret rendered images close the visual gap. The optimistic forecast is that "zero-shot manipulation" becomes practical in 2027-2028 for simple tasks; pessimistic forecasts see the contact-rich gap persisting much longer.
Key Takeaway
The sim-to-real gap decomposes into visual, contact-dynamics, actuator-latency, and sensor-noise errors. The 2026 deployment playbook closes the gap in five stages: domain-randomized sim training, system identification, real-data fine-tuning (100-1000 demos), HIL regression testing, and canary deployment. The combined recipe gets you from 34 percent real-world success (pure sim training) to 78-84 percent (full pipeline), at one-tenth the real-robot data cost of pure-real training.
Objective
Load an OpenVLA-7B checkpoint (trained on the Open X-Embodiment dataset, Padalkar et al. 2023) and run it on a simulated pick-and-place task in the LIBERO benchmark. Measure success rate over 50 trial rollouts under two task variants (in-distribution and a held-out object category). The point is to feel the sim2real gap from the inside: how a frontier VLA performs in simulation, and where the manipulation-policy failure modes from this section actually show up.
Setup
You need a CUDA-capable GPU (12 GB or more for a 7B model in fp16; 8 GB is workable with int8 quantization), the OpenVLA-7B checkpoint (openvla/openvla-7b on Hugging Face, paper at arXiv:2406.09246), and the LIBERO benchmark (libero-project.github.io). For learners without a 12-GB GPU, the smaller RT-1-X policy and the simpler ManiSkill2 environments are an alternative.
pip install transformers accelerate torch torchvision robosuite libero gymnasiumSteps
- Install LIBERO and pick two task suites: the
libero_spatialsuite for in-distribution pick-and-place (training data has seen the object categories) andlibero_objectfor the out-of-distribution variant (new object set). - Load OpenVLA-7B in fp16 (or int8 if VRAM is tight). The model takes an RGB image, a natural-language instruction, and produces a 7-DoF action chunk.
- Run 25 rollouts in each suite. Each rollout is up to 400 environment steps; success is gripper-contact with the target object plus successful placement in the target zone within episode horizon.
- Tally success rate and the failure breakdown. The interesting categorization is grasp failure (touched but did not lift), trajectory failure (dropped en route), and placement failure (released in the wrong location). Each category points at a different stage of the VLA pipeline.
- Apply domain randomization at inference by perturbing the camera pose by plus or minus 5 cm and rerunning. The drop in success rate is the empirical version of the sim2real gap analysis from subsection 24.13.1.
Expected Output
A CSV of per-rollout success flags and failure categories. Published OpenVLA results on LIBERO-spatial report success rates around 75 to 85 percent in-distribution and 50 to 65 percent on LIBERO-object out-of-distribution; the camera-perturbation rerun typically drops both numbers 10 to 15 percentage points, which is the closest single-experiment proxy for the sim2real gap.
Extension
Fine-tune OpenVLA-7B with 10 demonstrations on a single LIBERO-object task using LoRA, and re-evaluate; the standard finding is that a few-shot fine-tune recovers most of the out-of-distribution gap at a small fraction of the cost of training a new policy.
Show Answer
Show Answer
Show Answer
Continue to Section 25.1: Platforms.
Chapter 39 closes the LLM-robotics arc that Chapter 24 began. Chapter 24 moves to world models, the next abstraction layer that lets a policy predict the consequences of its actions before executing them. World models, sim-to-real, and VLA all share a common substrate (next-frame prediction); seeing the connection is what makes 2026 robotics feel coherent rather than a grab bag of techniques.