"Flow matching for robots: the diffusion paper from 2022 finally found its physical body in 2025."
Synth, Flow-Matched AI Agent
Physical Intelligence's pi-0 (Black et al., 2024) and its 2025 successor pi-0.5 replaced the discrete-token action head of OpenVLA and RT-2 with a flow-matching action expert that emits continuous trajectories. The trunk is still a vision-language model (a PaliGemma derivative), but instead of treating actions as 7-bit-per-dimension symbols, pi-0 learns a vector field that flows Gaussian noise into a valid action chunk. This gives smoother trajectories, higher control rates, and, in pi-0.5, the first credible cross-task generalist that ships in production cleaning-the-house demos.
Prerequisites
This section assumes the VLA equation from Section 24.1, the flow-matching objective from Section 19.7, and the action-tokenization vocabulary from the same chapter's opening section.
24.3.1 The Architecture: VLM Trunk plus Flow-Matching Expert
Physical Intelligence (pi) was founded in early 2024 by Sergey Levine, Chelsea Finn, and Karol Hausman, all former Google Brain and UC Berkeley researchers. The pi-0 release in late 2024 reportedly went from internal demo to public model checkpoint in roughly 90 days, an unusual speed for a robotics company. The flow-matching action head was a deliberate alternative to diffusion; the team felt the 50-step DDIM inference was unacceptable for real-time control.
pi-0 is structurally a two-headed model. The first head is a vision-language model (a 2B-parameter PaliGemma-Mix-3B variant) that ingests one or more camera frames plus an instruction and emits a sequence of latent tokens, the same way OpenVLA does. The second head is the novel piece: an "action expert" that conditions on those latent tokens and runs a small flow-matching network to produce a chunk of 50 continuous actions covering roughly a half-second of motion. The flow-matching expert has its own transformer layers (about 300M parameters, much smaller than the VLM trunk), and its outputs are 50 by D action vectors, where D is the robot's action dimension (typically 7 for end-effector control or 14-26 for whole-body humanoid control).
| Block | Role | Parameters | Output |
|---|---|---|---|
| VLM trunk (PaliGemma-Mix-3B) | Pixels + language to latent context | ~3B | Latent tokens (sequence) |
| Action expert (flow-matching head) | Latent context to action chunk | ~300M | 50 timesteps x D continuous actions |
| Robot-specific normalization (LoRA) | Per-embodiment fine-tuning | ~30M per robot | Calibrated continuous actions |
24.3.2 Flow Matching in One Page
Flow matching (Lipman et al., 2023, arXiv:2210.02747) is a close cousin of diffusion. The setup is identical: there is a distribution p_1 over valid action chunks (the training data) and a distribution p_0 over Gaussian noise. The model learns a time-dependent vector field v_theta(a, tau, context) that transports samples from p_0 to p_1 as tau goes from 0 to 1. The training objective is the conditional flow-matching loss:
$$L_{\text{CFM}}(\theta) = \mathbb{E}\left[ \left\| v_\theta(a_\tau, \tau, \text{context}) - (a_1 - a_0) \right\|^2 \right]$$
where a_0 is a noise sample, a_1 is a training action chunk, and a_tau = (1 - tau) * a_0 + tau * a_1 is the linear interpolant. The target the network must match is just the difference a_1 - a_0, the constant velocity of the straight-line path. At inference time you sample a_0 ~ N(0, I), then integrate the learned vector field forward in 5 to 10 Euler steps to land on a sample from p_1. The whole thing is a one-page pseudocode.
import torch
import torch.nn as nn
class FlowMatchingActionExpert(nn.Module):
"""Pi-0-style action expert. Inputs: vision-language context. Output: action chunk."""
def __init__(self, trunk, action_dim=7, horizon=50):
super().__init__()
self.trunk = trunk # small transformer, ~300M
self.action_dim = action_dim
self.horizon = horizon
self.time_embed = nn.Linear(1, self.trunk.hidden_size)
def velocity(self, a_tau, tau, vl_context):
# Concatenate time embedding, noised action, and frozen vision-language context.
t_emb = self.time_embed(tau.unsqueeze(-1))
x = torch.cat([vl_context, t_emb.unsqueeze(1), a_tau], dim=1)
return self.trunk(x)[:, -self.horizon:, :self.action_dim]
def loss(self, a_1, vl_context):
B = a_1.shape[0]
tau = torch.rand(B, device=a_1.device)
a_0 = torch.randn_like(a_1)
a_tau = (1 - tau[:, None, None]) * a_0 + tau[:, None, None] * a_1
target = a_1 - a_0
pred = self.velocity(a_tau, tau, vl_context)
return ((pred - target) ** 2).mean()
@torch.no_grad()
def sample(self, vl_context, n_steps=8):
# Sample noise, integrate the velocity field with Euler.
a = torch.randn(vl_context.shape[0], self.horizon, self.action_dim,
device=vl_context.device)
dt = 1.0 / n_steps
for step in range(n_steps):
tau = torch.full((a.shape[0],), step * dt, device=a.device)
a = a + dt * self.velocity(a, tau, vl_context)
return a # shape (B, horizon, action_dim)
Standard diffusion (DDPM, score-matching) trains on noised samples at random noise levels and requires 25 to 250 denoising steps at inference. Flow matching with the linear interpolant trains on the same noised samples but learns the velocity rather than the score, and the straight-line interpolant means Euler with 5-10 steps suffices. In a robotics control loop, 5 vs 50 forward passes per chunk is the difference between 20 Hz and 2 Hz; flow matching is what makes diffusion-style policies practical for real-time control. The Diffusion Policy work of Chi et al. (2023, arXiv:2303.04137) used ~100 DDIM steps; pi-0 uses 8.
24.3.3 The Training Mixture and the Data Question
pi-0 trains on a custom mixture that Physical Intelligence assembled by combining a large internal teleoperation dataset (estimated ~10,000 hours across several humanoid and bimanual platforms), portions of Open X-Embodiment, and a few hundred hours of "task-specific" data for the dexterous tasks pi-0.5 is meant to handle. The total training corpus is multiple times the size of OpenVLA's, and roughly half of it is unreleased. This is one of two reasons pi-0 outperforms OpenVLA on harder benchmarks; the other is the architectural improvement of the flow head. Disentangling the two contributions is hard from public information, but the pi-0 technical report's ablations suggest the architectural piece accounts for ~30 percent of the benchmark gap and the data piece accounts for ~70 percent. Data still scales more than architecture in 2026 robotics.
Unlike OpenVLA, pi-0 is not open-weights as of early 2026. Physical Intelligence has released a technical report, a partial open-source training stack (the OpenPI repo), and an evaluation harness, but the trained checkpoint is gated behind a research-access program. This is the largest practical obstacle to replicating pi-0 results outside Physical Intelligence's labs and is why many academic groups still benchmark against OpenVLA. The OpenPI repo lets you train your own version, but matching the production pi-0 quality requires Physical Intelligence's unreleased teleop data.
24.3.4 pi-0.5: Cross-Task Generalization
pi-0.5, released in late 2025, is pi-0 plus a substantially larger training mixture and a few architectural tweaks aimed at cross-task generalization rather than cross-robot generalization. The headline result is that pi-0.5 can perform a long-horizon household task it has never seen during training, given only a natural-language instruction and a few seconds of front-camera footage of the scene. The published demos include tidying a kitchen counter, folding laundry, and unloading a dishwasher, all with the same model checkpoint and without per-task finetuning. Internally, pi-0.5 keeps the same flow-matching head but adds a "task-conditioning" pathway: a high-level planner LLM (a 7B model that runs separately) produces a structured plan as a JSON object, which is encoded into a sequence of plan tokens that join the latent context. The flow expert then conditions on both the visual context and the plan tokens.
| Capability | pi-0 (Q4 2024) | pi-0.5 (Q4 2025) |
|---|---|---|
| Action representation | Continuous flow-matching | Continuous flow-matching |
| Trunk | PaliGemma-Mix-3B | PaliGemma-Mix-3B + plan tokens |
| Training data (estimated) | ~10k hours teleop + OXE | ~25k hours teleop + OXE + synthetic |
| Best long-horizon task | ~5 min coffee-making | ~30 min full kitchen tidy |
| Zero-shot to unseen task | Marginal | Practical (~60% success) |
| Humanoid support | Limited (1 robot) | 4 humanoid platforms |
pi-0.5 splits the policy into a high-level planner (separate LLM, runs at ~1 Hz) and a low-level VLA (the flow-matching expert, runs at ~20 Hz). The opposite approach is the pure end-to-end model (RT-2-X, the next section), where one transformer ingests an instruction and emits actions without an intermediate plan representation. As of 2026 the field is split: Physical Intelligence and the SayCan school (Chapter 39) bet on hierarchy; Google DeepMind and the AutoRT line (Section 24.4) bet on end-to-end. Both work; the trade-offs are about interpretability, debugging, and latency rather than raw capability.
24.3.5 Running pi-0 via OpenPI
Physical Intelligence's OpenPI library is the open-source companion to pi-0. It includes the model architecture, training stack, and evaluation harness, plus a permissive license for the code (the trained weights are gated, but a smaller "pi-0-fast" demo checkpoint is openly available). The inference API is similar in shape to OpenVLA's but exposes the chunking horizon as a first-class concept:
import jax
from openpi.policies import policy_config
from openpi.training import config as cfg
# Load the public pi-0-fast checkpoint (smaller demo, weights released).
policy = policy_config.create_trained_policy(
config_name="pi0_fast_aloha",
checkpoint_dir="gs://openpi-assets/pi0_fast_aloha",
)
obs = {
"image": {
"cam_high": third_person_frame, # np.uint8 (H, W, 3)
"cam_wrist_left": left_wrist_frame,
"cam_wrist_right": right_wrist_frame,
},
"state": joint_positions, # np.float32 (D,)
"prompt": "fold the towel and place it in the basket",
}
action_chunk = policy.infer(obs)["actions"] # shape (50, D), float32
# Execute the first 10 actions, then re-query (receding horizon).
for a in action_chunk[:10]:
robot.execute(a, control_hz=50)
24.3.6 What Changed When the Action Head Became Continuous
Switching from discrete action tokens to a continuous flow-matching head produces three practical effects. First, trajectories are smoother: discrete bins introduce a small but visible chatter at the bin boundaries, especially during slow precise motions, which flow matching does not have. Second, the dexterity ceiling rises: 14- and 26-DOF whole-body humanoid control is unwieldy with discrete tokens (you would emit 26 tokens per action chunk, multiplied by 50 timesteps, which is 1,300 forward passes per control decision), but flow matching handles arbitrary action dimension at no extra cost. Third, the model is harder to interpret: you cannot inspect the action vocabulary distribution to see "the policy is 60 percent sure it wants to grasp here versus there"; the velocity field does not expose marginals as naturally.
Pick pi-0 (or its OpenPI demo) if your robot is high-DOF (a humanoid or a 14-DOF bimanual setup), if you need smooth contact-rich motion (folding cloth, pouring liquid), or if you have the connections to get research access to the production checkpoint. Pick OpenVLA if your robot is a standard 7-DOF arm, if you want open weights and easy debugging, or if you need to LoRA-finetune on a few hundred demos. As of 2026, hobbyists and most academic labs default to OpenVLA; commercial dexterous-manipulation startups default to pi-0 or its successors.
Key Takeaway
pi-0 keeps the VLM trunk of OpenVLA but replaces the discrete action head with a flow-matching expert that integrates Gaussian noise into a 50-step continuous action chunk in ~8 Euler steps. This unlocks smooth high-DOF control, dexterous manipulation, and (in pi-0.5) cross-task generalization on long-horizon household tasks. The trade-off is reduced interpretability and a closed-weights production checkpoint; the OpenPI demo gives you a runnable approximation.
Show Answer
Show Answer
Show Answer
Continue to Section 24.4: RT-2-X & the Data-Scaling Story.
You have now seen two architectures (discrete tokens, flow matching). Section 24.4 backs out and asks the scaling-law question: how much does data matter compared to architecture? The RT-2-X result and the Open X-Embodiment data-scaling curves give a surprisingly clean answer.