Physical Intelligence pi-0 / pi-0.5

Section 24.3

"Flow matching for robots: the diffusion paper from 2022 finally found its physical body in 2025."

SynthSynth, Flow-Matched AI Agent
Big Picture

Physical Intelligence's pi-0 (Black et al., 2024) and its 2025 successor pi-0.5 replaced the discrete-token action head of OpenVLA and RT-2 with a flow-matching action expert that emits continuous trajectories. The trunk is still a vision-language model (a PaliGemma derivative), but instead of treating actions as 7-bit-per-dimension symbols, pi-0 learns a vector field that flows Gaussian noise into a valid action chunk. This gives smoother trajectories, higher control rates, and, in pi-0.5, the first credible cross-task generalist that ships in production cleaning-the-house demos.

Prerequisites

This section assumes the VLA equation from Section 24.1, the flow-matching objective from Section 19.7, and the action-tokenization vocabulary from the same chapter's opening section.

24.3.1 The Architecture: VLM Trunk plus Flow-Matching Expert

Fun Fact

Physical Intelligence (pi) was founded in early 2024 by Sergey Levine, Chelsea Finn, and Karol Hausman, all former Google Brain and UC Berkeley researchers. The pi-0 release in late 2024 reportedly went from internal demo to public model checkpoint in roughly 90 days, an unusual speed for a robotics company. The flow-matching action head was a deliberate alternative to diffusion; the team felt the 50-step DDIM inference was unacceptable for real-time control.

pi-0 is structurally a two-headed model. The first head is a vision-language model (a 2B-parameter PaliGemma-Mix-3B variant) that ingests one or more camera frames plus an instruction and emits a sequence of latent tokens, the same way OpenVLA does. The second head is the novel piece: an "action expert" that conditions on those latent tokens and runs a small flow-matching network to produce a chunk of 50 continuous actions covering roughly a half-second of motion. The flow-matching expert has its own transformer layers (about 300M parameters, much smaller than the VLM trunk), and its outputs are 50 by D action vectors, where D is the robot's action dimension (typically 7 for end-effector control or 14-26 for whole-body humanoid control).

BlockRoleParametersOutput
VLM trunk (PaliGemma-Mix-3B)Pixels + language to latent context~3BLatent tokens (sequence)
Action expert (flow-matching head)Latent context to action chunk~300M50 timesteps x D continuous actions
Robot-specific normalization (LoRA)Per-embodiment fine-tuning~30M per robotCalibrated continuous actions
Table 24.3.1: pi-0's two-headed architecture. The VLM trunk is reused across all robots and all tasks; the action expert is the only piece that emits motor commands; per-embodiment LoRA adapters re-normalize for each robot platform.

24.3.2 Flow Matching in One Page

Flow matching (Lipman et al., 2023, arXiv:2210.02747) is a close cousin of diffusion. The setup is identical: there is a distribution p_1 over valid action chunks (the training data) and a distribution p_0 over Gaussian noise. The model learns a time-dependent vector field v_theta(a, tau, context) that transports samples from p_0 to p_1 as tau goes from 0 to 1. The training objective is the conditional flow-matching loss:

$$L_{\text{CFM}}(\theta) = \mathbb{E}\left[ \left\| v_\theta(a_\tau, \tau, \text{context}) - (a_1 - a_0) \right\|^2 \right]$$

where a_0 is a noise sample, a_1 is a training action chunk, and a_tau = (1 - tau) * a_0 + tau * a_1 is the linear interpolant. The target the network must match is just the difference a_1 - a_0, the constant velocity of the straight-line path. At inference time you sample a_0 ~ N(0, I), then integrate the learned vector field forward in 5 to 10 Euler steps to land on a sample from p_1. The whole thing is a one-page pseudocode.

import torch
import torch.nn as nn

class FlowMatchingActionExpert(nn.Module):
    """Pi-0-style action expert. Inputs: vision-language context. Output: action chunk."""

    def __init__(self, trunk, action_dim=7, horizon=50):
        super().__init__()
        self.trunk = trunk                             # small transformer, ~300M
        self.action_dim = action_dim
        self.horizon = horizon
        self.time_embed = nn.Linear(1, self.trunk.hidden_size)

    def velocity(self, a_tau, tau, vl_context):
        # Concatenate time embedding, noised action, and frozen vision-language context.
        t_emb = self.time_embed(tau.unsqueeze(-1))
        x = torch.cat([vl_context, t_emb.unsqueeze(1), a_tau], dim=1)
        return self.trunk(x)[:, -self.horizon:, :self.action_dim]

    def loss(self, a_1, vl_context):
        B = a_1.shape[0]
        tau = torch.rand(B, device=a_1.device)
        a_0 = torch.randn_like(a_1)
        a_tau = (1 - tau[:, None, None]) * a_0 + tau[:, None, None] * a_1
        target = a_1 - a_0
        pred = self.velocity(a_tau, tau, vl_context)
        return ((pred - target) ** 2).mean()

    @torch.no_grad()
    def sample(self, vl_context, n_steps=8):
        # Sample noise, integrate the velocity field with Euler.
        a = torch.randn(vl_context.shape[0], self.horizon, self.action_dim,
                        device=vl_context.device)
        dt = 1.0 / n_steps
        for step in range(n_steps):
            tau = torch.full((a.shape[0],), step * dt, device=a.device)
            a = a + dt * self.velocity(a, tau, vl_context)
        return a     # shape (B, horizon, action_dim)
Code Fragment 24.3.1a: A 30-line pi-0-style flow-matching action expert. Training is one MSE loss; inference is eight Euler steps. The contrast with the autoregressive token loop of OpenVLA is stark: instead of 7 transformer forward passes per action (one per DOF), pi-0 takes a fixed 8 forward passes per chunk of 50 actions, an effective 40x throughput improvement per emitted scalar.
Key Insight: Why flow matching beats diffusion in robotics

Standard diffusion (DDPM, score-matching) trains on noised samples at random noise levels and requires 25 to 250 denoising steps at inference. Flow matching with the linear interpolant trains on the same noised samples but learns the velocity rather than the score, and the straight-line interpolant means Euler with 5-10 steps suffices. In a robotics control loop, 5 vs 50 forward passes per chunk is the difference between 20 Hz and 2 Hz; flow matching is what makes diffusion-style policies practical for real-time control. The Diffusion Policy work of Chi et al. (2023, arXiv:2303.04137) used ~100 DDIM steps; pi-0 uses 8.

24.3.3 The Training Mixture and the Data Question

pi-0 trains on a custom mixture that Physical Intelligence assembled by combining a large internal teleoperation dataset (estimated ~10,000 hours across several humanoid and bimanual platforms), portions of Open X-Embodiment, and a few hundred hours of "task-specific" data for the dexterous tasks pi-0.5 is meant to handle. The total training corpus is multiple times the size of OpenVLA's, and roughly half of it is unreleased. This is one of two reasons pi-0 outperforms OpenVLA on harder benchmarks; the other is the architectural improvement of the flow head. Disentangling the two contributions is hard from public information, but the pi-0 technical report's ablations suggest the architectural piece accounts for ~30 percent of the benchmark gap and the data piece accounts for ~70 percent. Data still scales more than architecture in 2026 robotics.

Warning: The open-weights gap on pi-0

Unlike OpenVLA, pi-0 is not open-weights as of early 2026. Physical Intelligence has released a technical report, a partial open-source training stack (the OpenPI repo), and an evaluation harness, but the trained checkpoint is gated behind a research-access program. This is the largest practical obstacle to replicating pi-0 results outside Physical Intelligence's labs and is why many academic groups still benchmark against OpenVLA. The OpenPI repo lets you train your own version, but matching the production pi-0 quality requires Physical Intelligence's unreleased teleop data.

24.3.4 pi-0.5: Cross-Task Generalization

pi-0.5, released in late 2025, is pi-0 plus a substantially larger training mixture and a few architectural tweaks aimed at cross-task generalization rather than cross-robot generalization. The headline result is that pi-0.5 can perform a long-horizon household task it has never seen during training, given only a natural-language instruction and a few seconds of front-camera footage of the scene. The published demos include tidying a kitchen counter, folding laundry, and unloading a dishwasher, all with the same model checkpoint and without per-task finetuning. Internally, pi-0.5 keeps the same flow-matching head but adds a "task-conditioning" pathway: a high-level planner LLM (a 7B model that runs separately) produces a structured plan as a JSON object, which is encoded into a sequence of plan tokens that join the latent context. The flow expert then conditions on both the visual context and the plan tokens.

Capabilitypi-0 (Q4 2024)pi-0.5 (Q4 2025)
Action representationContinuous flow-matchingContinuous flow-matching
TrunkPaliGemma-Mix-3BPaliGemma-Mix-3B + plan tokens
Training data (estimated)~10k hours teleop + OXE~25k hours teleop + OXE + synthetic
Best long-horizon task~5 min coffee-making~30 min full kitchen tidy
Zero-shot to unseen taskMarginalPractical (~60% success)
Humanoid supportLimited (1 robot)4 humanoid platforms
Figure 24.3.1b: pi-0 versus pi-0.5. The architecture is mostly stable across the version bump; the gains come from data scale and the addition of an explicit planning conditioning path.
Research Frontier: The "hierarchy" debate

pi-0.5 splits the policy into a high-level planner (separate LLM, runs at ~1 Hz) and a low-level VLA (the flow-matching expert, runs at ~20 Hz). The opposite approach is the pure end-to-end model (RT-2-X, the next section), where one transformer ingests an instruction and emits actions without an intermediate plan representation. As of 2026 the field is split: Physical Intelligence and the SayCan school (Chapter 39) bet on hierarchy; Google DeepMind and the AutoRT line (Section 24.4) bet on end-to-end. Both work; the trade-offs are about interpretability, debugging, and latency rather than raw capability.

24.3.5 Running pi-0 via OpenPI

Physical Intelligence's OpenPI library is the open-source companion to pi-0. It includes the model architecture, training stack, and evaluation harness, plus a permissive license for the code (the trained weights are gated, but a smaller "pi-0-fast" demo checkpoint is openly available). The inference API is similar in shape to OpenVLA's but exposes the chunking horizon as a first-class concept:

import jax
from openpi.policies import policy_config
from openpi.training import config as cfg

# Load the public pi-0-fast checkpoint (smaller demo, weights released).
policy = policy_config.create_trained_policy(
    config_name="pi0_fast_aloha",
    checkpoint_dir="gs://openpi-assets/pi0_fast_aloha",
)

obs = {
    "image": {
        "cam_high": third_person_frame,      # np.uint8 (H, W, 3)
        "cam_wrist_left": left_wrist_frame,
        "cam_wrist_right": right_wrist_frame,
    },
    "state": joint_positions,                  # np.float32 (D,)
    "prompt": "fold the towel and place it in the basket",
}
action_chunk = policy.infer(obs)["actions"]   # shape (50, D), float32

# Execute the first 10 actions, then re-query (receding horizon).
for a in action_chunk[:10]:
    robot.execute(a, control_hz=50)
Code Fragment 24.3.2: Calling pi-0-fast through OpenPI. The 50-step action chunk at 50 Hz covers one second of motion; re-querying every 200 ms (10 actions) gives the receding-horizon behavior from Section 24.1. JAX rather than PyTorch is a Physical Intelligence stylistic choice; the math is identical.

24.3.6 What Changed When the Action Head Became Continuous

Switching from discrete action tokens to a continuous flow-matching head produces three practical effects. First, trajectories are smoother: discrete bins introduce a small but visible chatter at the bin boundaries, especially during slow precise motions, which flow matching does not have. Second, the dexterity ceiling rises: 14- and 26-DOF whole-body humanoid control is unwieldy with discrete tokens (you would emit 26 tokens per action chunk, multiplied by 50 timesteps, which is 1,300 forward passes per control decision), but flow matching handles arbitrary action dimension at no extra cost. Third, the model is harder to interpret: you cannot inspect the action vocabulary distribution to see "the policy is 60 percent sure it wants to grasp here versus there"; the velocity field does not expose marginals as naturally.

Real-World Scenario: When to pick pi-0 over OpenVLA

Pick pi-0 (or its OpenPI demo) if your robot is high-DOF (a humanoid or a 14-DOF bimanual setup), if you need smooth contact-rich motion (folding cloth, pouring liquid), or if you have the connections to get research access to the production checkpoint. Pick OpenVLA if your robot is a standard 7-DOF arm, if you want open weights and easy debugging, or if you need to LoRA-finetune on a few hundred demos. As of 2026, hobbyists and most academic labs default to OpenVLA; commercial dexterous-manipulation startups default to pi-0 or its successors.

Key Takeaway

Key Insight

pi-0 keeps the VLM trunk of OpenVLA but replaces the discrete action head with a flow-matching expert that integrates Gaussian noise into a 50-step continuous action chunk in ~8 Euler steps. This unlocks smooth high-DOF control, dexterous manipulation, and (in pi-0.5) cross-task generalization on long-horizon household tasks. The trade-off is reduced interpretability and a closed-weights production checkpoint; the OpenPI demo gives you a runnable approximation.

Self-Check
Q1: Why does flow matching require only 5-10 Euler steps at inference while DDPM diffusion requires 25-250 denoising steps? Tie your answer to the choice of straight-line interpolant.
Show Answer
DDPM is trained against a curved noise schedule, so the model's learned score points along a curved path through latent space; Euler integration with too few steps deviates from the curve and produces artifacts, forcing 25 to 250 small steps. Flow matching with the linear interpolant $a_\tau = (1-\tau) a_0 + \tau a_1$ makes the true velocity field constant along the path: it is simply $a_1 - a_0$. A constant velocity field is exactly straight-line integrable, so 5 to 10 Euler steps suffice to reach the target distribution. The geometric simplicity of the path, not the architecture, is what shrinks the step count.
Q2: You are training a 26-DOF humanoid VLA. Estimate the per-chunk forward-pass count under (a) OpenVLA-style discrete tokens with horizon 50 and (b) pi-0-style flow matching with 8 Euler steps. Which is cheaper, and by what factor?
Show Answer
(a) OpenVLA emits one token per DOF per timestep, so a horizon-50 chunk requires 26 times 50 = 1,300 forward passes through the transformer trunk. (b) pi-0 produces the entire chunk in 8 Euler steps regardless of action dimension, so 8 forward passes. Flow matching is cheaper by a factor of 1,300 / 8 = 162x in raw forward-pass count. The trunk size also matters (pi-0's flow expert is smaller than the LLaMA trunk), so wall-clock improvement is even larger. This scaling difference is why discrete-token VLAs are impractical for high-DOF humanoids and flow matching has become the default for that regime.
Q3: pi-0.5 splits planning (LLM, 1 Hz) from action generation (flow expert, 20 Hz). What latency-versus-reactivity trade-off does this introduce, and how would you mitigate the worst-case scenario where the planner emits an outdated plan?
Show Answer
The planner running at 1 Hz means up to 1 second of staleness between scene change and updated plan; the flow expert at 20 Hz can react quickly to local perturbations but cannot revise the high-level strategy until the planner produces a new JSON plan. The worst case is a scene change that invalidates the plan (someone moves the target object) while the planner is still computing. Mitigations: (1) run a fast change-detector that triggers an early planner re-query when scene state diverges from the assumed state; (2) give the flow expert a small set of fallback primitives (stop, retreat, re-grasp) it can execute autonomously while waiting for a fresh plan; (3) keep the planner's last K plans cached so the flow expert can switch to a previously valid plan rather than blocking. These mitigations turn a hard latency floor into a graceful degradation.
What's Next

Continue to Section 24.4: RT-2-X & the Data-Scaling Story.

You have now seen two architectures (discrete tokens, flow matching). Section 24.4 backs out and asks the scaling-law question: how much does data matter compared to architecture? The RT-2-X result and the Open X-Embodiment data-scaling curves give a surprisingly clean answer.

Further Reading
Black, K., et al. (2024). pi-0: A Vision-Language-Action Flow Model for General Robot Control. "Physical Intelligence Technical Report".
Physical Intelligence. (2025). pi-0.5: Scaling Robotic Foundation Models to Household Tasks. "Technical Report".
Lipman, Y., et al. (2023). Flow Matching for Generative Modeling. "ICLR 2023, arXiv:2210.02747".
Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. "IJRR 2024, arXiv:2303.04137".
Beyer, L., et al. (2024). PaliGemma: A Versatile 3B VLM for Transfer. "arXiv:2407.07726".
Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. "ICRA 2024, arXiv:2310.08864".