Section 24.2: OpenVLA-7B Reference Implementation

"Open the weights, read the tokenizer, run the policy. That is the whole pedagogy in one sentence."
Pip, Reference-Implementation-Reader AI Agent

Big Picture

OpenVLA-7B (Kim et al., 2024, arXiv:2406.09246) is the first open-weights generalist VLA and is the easiest concrete model to study end to end. It pairs a Llama-2 7B backbone with a fused SigLIP+DINOv2 vision encoder, trains on the 970k-demonstration Open X-Embodiment mixture, and ships a working PyTorch reference plus a Hugging Face checkpoint. This section opens its hood: the architecture, the data, the action tokenizer, and the inference cost. By the end you will understand what every weight tensor in the public checkpoint does.

Prerequisites

This section assumes the VLA equation and action tokenization from Section 24.1, the VLM backbone fundamentals from Section 22.1, and the LoRA fine-tuning recipe from Section 13.3.

24.2.1 The Architecture from Pixels to Action Tokens

OpenVLA stitches together four components, all of them pretrained and frozen or finetuned at different rates. Visual input is a single third-person RGB frame at 224 by 224 pixels (no wrist camera in the released checkpoint, although the data recipe supports one). The fused vision encoder concatenates the patch tokens from a SigLIP-SO400M tower and a DINOv2-Large tower, on the theory that SigLIP's semantic understanding and DINOv2's spatial-feature sharpness are complementary. A learned projection compresses the 1408-D fused vision feature into the 4096-D hidden size of the Llama-2 7B backbone. The instruction string is BPE-tokenized with the LLaMA tokenizer extended by the 1,792 action tokens described in Section 24.1. The decoder-only transformer trunk autoregresses over the unified sequence, emitting one action token per dimension.

Component	Source	Parameters	Trained from scratch?	Frozen during finetune?
SigLIP-SO400M vision tower	Google SigLIP (Zhai et al., 2023)	0.4B	No	No (full finetune)
DINOv2-Large vision tower	Meta DINOv2 (Oquab et al., 2024)	0.3B	No	No (full finetune)
Vision -> LM projection	OpenVLA	~6M	Yes	No
Llama-2 7B trunk	Meta Llama-2 (Touvron et al., 2023)	7B	No	No (full finetune)
Action vocabulary embeddings	OpenVLA	~7M	Yes (initialized random)	No

Table 24.2.1: OpenVLA-7B parameter inventory. Total parameters: 7.6B. The only weights initialized from scratch are the projection layer and the action token rows of the embedding and LM-head matrices; everything else inherits from pretrained checkpoints.

Key Insight: Why fuse two vision encoders

SigLIP is trained with a sigmoid contrastive loss on image-text pairs and excels at semantic categorization ("this is a red mug"). DINOv2 is trained with self-distillation and produces sharper spatial features ("the mug is at pixel (130, 90)"). Manipulation needs both: the policy must know what object to grasp and where in the image plane it lives. Concatenating the two streams adds ~0.3B parameters but improves OpenVLA's success rate on the BridgeData-V2 benchmark by ~12 percentage points over either encoder alone. This is a recurring pattern in 2026 robotics, semantic-plus-spatial fusion at the vision-encoder layer.

24.2.2 The Training Data: Open X-Embodiment

OpenVLA trains on the Open X-Embodiment mixture (Open X-Embodiment Collaboration, 2024, arXiv:2310.08864), a curated union of 60 robot datasets contributed by 21 institutions covering 22 distinct robot embodiments. After filtering for usable language annotations and trajectory quality, the OpenVLA training mixture contains roughly 970,000 trajectories, with the largest contributors being BridgeData-V2 (60k WidowX trajectories), the original RT-1 dataset (130k Everyday Robot trajectories), and Berkeley Cable Routing (1.5k Franka trajectories). Each trajectory is a sequence of (image, instruction, action) tuples at 5-30 Hz. The actions are normalized to a unified 7-D end-effector delta convention, which is the mapping that makes cross-embodiment training tractable.

# Loading the Open X-Embodiment mixture through the rlds_dataset_builder.
import tensorflow_datasets as tfds
from openvla.data.oxe import OXE_NAMED_MIXES, make_interleaved_dataset

mixture_spec = OXE_NAMED_MIXES["oxe_magic_soup"]   # the OpenVLA training mix
dataset = make_interleaved_dataset(
    mixture_spec,
    data_dir="gs://gresearch/robotics",
    train=True,
    window_size=1,            # single-step action prediction
    future_action_window_size=0,
    action_proprio_normalization_type="normal",
    batch_size=32,
    shuffle_buffer_size=100_000,
)

for batch in dataset.take(1):
    print(batch["observation"]["image_primary"].shape)   # (32, 224, 224, 3)
    print(batch["task"]["language_instruction"].numpy()[0])
    print(batch["action"].shape)                       # (32, 1, 7)

Code Fragment 24.2.1a: Streaming the Open X-Embodiment mixture for training. The "oxe_magic_soup" alias is the curated subset OpenVLA actually trained on, with hand-tuned per-dataset sampling weights that downweight noisy contributors and upweight high-quality ones. The action normalization step happens upstream; downstream code treats actions as a single 7-D vector regardless of the source robot.

Warning: The "magic soup" weight choice matters

The per-dataset sampling weights are not uniform. BridgeData-V2 and RT-1 together comprise about 60 percent of training tokens despite being a smaller fraction of total trajectories, because they have the highest-quality language annotations and the most consistent camera viewpoints. Reweighting the mixture by hand was a multi-week effort and is one of the under-publicized engineering costs of training a generalist VLA. If you finetune OpenVLA on your own data, you are competing with that handcrafted balance; expect to spend significant time on data-quality work, not just on model code.

24.2.3 The Action Tokenizer in the Released Checkpoint

OpenVLA's released action tokenizer follows the structure from Section 24.1, with two implementation details worth knowing. First, the bin boundaries are computed from the empirical 1st and 99th percentiles of each action dimension on the training data, not from a hand-chosen [lo, hi] range. This puts more bin resolution where the data actually lives. Second, the tokenizer reuses the 256 least-frequently-used tokens from the LLaMA vocabulary instead of growing the embedding table. The 256 chosen indices were a list of obscure Unicode characters and rare BPE pieces that empirical analysis showed almost never appear in instruction strings, so colliding them with action tokens caused no measurable text degradation.

# OpenVLA's actual action tokenizer (paraphrased from the released source).
import numpy as np
from transformers import AutoTokenizer

class OpenVLAActionTokenizer:
    def __init__(self, tokenizer, bins, n_action_dims=7):
        self.tokenizer = tokenizer
        self.bins = bins                          # np.ndarray of bin edges, shape (n_dims, n_bins+1)
        self.n_action_dims = n_action_dims
        # Use the last 256 tokens of LLaMA's vocab for actions.
        self.action_token_begin_idx = len(tokenizer) - 256

    def __call__(self, action: np.ndarray) -> list[str]:
        # Quantize each dim independently into [0, 255] bin indices.
        discretized = np.digitize(
            action, bins=self.bins[0, 1:-1]
        ).clip(0, 255)
        # Map each bin index to a special vocabulary token id.
        token_ids = self.tokenizer.vocab_size - 256 + discretized
        return self.tokenizer.batch_decode(token_ids)

    def decode(self, token_ids: np.ndarray) -> np.ndarray:
        bin_idx = token_ids - self.action_token_begin_idx
        centers = 0.5 * (self.bins[:, :-1] + self.bins[:, 1:])
        return centers[np.arange(self.n_action_dims), bin_idx]

Code Fragment 24.2.2: The actual OpenVLA action tokenizer. The clever trick is reusing the tail of the existing LLaMA vocabulary, avoiding any embedding-table surgery. This makes the released model load with stock transformers.AutoModelForVision2Seq without monkey-patching.

24.2.4 Running the Model: The Six-Line Inference Loop

The OpenVLA Hugging Face checkpoint runs out of the box with the stock transformers library, which is the strongest possible signal that "VLA equals LLM with motor tokens" is a true description and not a marketing slogan. The full inference loop fits in a screen of code:

import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor

processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b", trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    trust_remote_code=True,
)

image = Image.open("frame.png")
prompt = "In: What action should the robot take to place the red block on the plate?\nOut:"
inputs = processor(prompt, image).to("cuda", dtype=torch.bfloat16)
action = model.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)   # np.ndarray of shape (7,) ready to ship to the robot controller

Output: [ 0.024 -0.018 0.041 0.003 0.012 -0.007 1.000]

Code Fragment 24.2.3: The full OpenVLA inference loop. The unnorm_key argument selects which robot's normalization stats to invert (each contributing dataset has its own scale), and is the only robotics-specific knob in this call sequence. Everything else is the standard Hugging Face vision-to-sequence API.

24.2.5 Inference Performance and the Quantization Frontier

OpenVLA-7B in bfloat16 runs at about 6 Hz on a single A100 80GB, which is acceptable for slow-manipulation tasks but too slow for any task involving moving objects or human collaboration. The community has spent the last year optimizing this. Three results matter. First, INT4 quantization with bitsandbytes (the same recipe from Chapter 16 on QLoRA) brings inference to 11 Hz with under 1 percent success-rate degradation on standard benchmarks. Second, speculative decoding with a small draft model (a 350M-param policy distilled from OpenVLA) bumps wall-clock throughput to 18 Hz. Third, ahead-of-time TensorRT compilation of the full pipeline gives another 1.4x, pushing throughput past 25 Hz, which is fast enough for most contact-rich manipulation.

Configuration	Hardware	Throughput (Hz)	BridgeData-V2 success
bfloat16, eager	A100 80GB	6.1	72.5%
bfloat16, FlashAttention-2	A100 80GB	8.4	72.5%
INT4 (bitsandbytes NF4)	A100 80GB	11.0	72.0%
INT4 + speculative (draft 350M)	A100 80GB	18.2	71.8%
INT4 + speculative + TensorRT-LLM	A100 80GB	25.6	71.5%
INT4 (bitsandbytes NF4)	RTX 4090 24GB	7.3	71.5%

Figure 24.2.1b: OpenVLA-7B inference throughput across optimization stages, measured on a single GPU. The trajectory from 6 Hz to 25 Hz is the same playbook as text LLM serving from Part X, applied unchanged to robotics.

Real-World Scenario

Running OpenVLA on a 4090 in your basement

The INT4 quantized checkpoint fits in 14 GB of VRAM, which means it runs on a single consumer RTX 4090. With FlashAttention-2 and a JIT-compiled vision encoder you can sustain ~7 Hz, enough for a slow tabletop manipulation demo. This is the entire reason hobbyist robotics in 2026 looks the way it does: a $1,500 GPU plus a $400 ALOHA-clone arm gives you a working generalist robot policy. The bottleneck is no longer compute; it is data for your specific task.

24.2.6 Finetuning OpenVLA on Your Own Robot

The released checkpoint understands the union of training distributions but is not specialized to any one robot you might own. Finetuning is straightforward thanks to LoRA. The recipe (LoRA rank 32 on all attention projections, learning rate 5e-4, AdamW, batch size 16, two epochs of demonstration data) consistently lifts success rate by 15 to 30 percentage points on a new robot after a few thousand demonstrations. The OpenVLA team has open-sourced both the LoRA finetuning script and a separate fully-finetune recipe.

# LoRA finetuning of OpenVLA on a custom dataset, ~30 lines total.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForVision2Seq, AutoProcessor, Trainer

processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)

lora_cfg = LoraConfig(
    r=32, lora_alpha=64, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()   # typically ~50M of 7.6B

trainer = Trainer(
    model=model,
    train_dataset=my_robot_dataset,            # your demonstrations, formatted with the OXE schema
    args=TrainingArguments(
        per_device_train_batch_size=16,
        learning_rate=5e-4,
        num_train_epochs=2,
        bf16=True,
        gradient_checkpointing=True,
    ),
)
trainer.train()

Code Fragment 24.2.4: LoRA finetuning of OpenVLA fits in 24 GB VRAM with gradient checkpointing. ~50M trainable parameters, two epochs over ~1k demonstrations, ~6 hours on a single A100, gets you a robot policy that beats the off-the-shelf checkpoint on your hardware. This is the same QLoRA recipe from Chapter 16.

Fun Fact: A robot policy that fits in a Docker image

The OpenVLA INT4 checkpoint plus its tokenizer plus the inference wrapper is 4.2 GB. You can ship it as part of a single Docker image alongside your ROS 2 stack. In 2024, "the robot brain" was a research artifact bound to a lab's CUDA cluster. In 2026, "the robot brain" is a container you docker-pull from Hugging Face, alongside Postgres and nginx, and your robot's CI runs on GitHub Actions. The de-mystification of robotics through the LLM stack is mostly a story about packaging.

24.2.7 Where OpenVLA Falls Short

OpenVLA is the right starting point but is not state of the art. It uses a single third-person camera (no wrist cam), predicts a single action per forward pass (no chunking), and trains entirely on data from arm-and-gripper manipulators (no humanoids, no quadrupeds, no soft hands). Its absolute success rates on the harder Open X-Embodiment benchmarks are still in the 30-50 percent range for unseen objects, and its dexterity ceiling, set by the 7-D end-effector action space, excludes any task that requires multi-finger coordination. The next two sections cover the models that have pushed past those limits: pi-0 in Section 24.3 via flow-matching and continuous action heads, and the RT-2-X data-scaling story in Section 24.4.

Key Takeaway

Key Insight

OpenVLA-7B is a Llama-2 7B trunk with a fused SigLIP+DINOv2 vision encoder, finetuned on 970k Open X-Embodiment demonstrations, with the last 256 LLaMA vocabulary slots repurposed as action bins. It loads through stock Hugging Face APIs, runs at 6 Hz on an A100 (25 Hz with INT4 + speculative + TensorRT), and LoRA-finetunes for your robot in 6 hours on a single A100. It is the reference implementation you should keep open in a tab while reading the rest of this chapter.

Self-Check

Q1: Why does OpenVLA reuse the last 256 LLaMA vocabulary tokens for actions instead of growing the embedding table? Name two practical consequences of that choice.

Show Answer

Growing the embedding table requires initializing new rows (typically Gaussian or zero), which breaks checkpoint compatibility with stock Hugging Face loaders and forces a custom model class. Repurposing existing tail tokens preserves the original tensor shapes so OpenVLA loads through AutoModelForVision2Seq with no surgery. Two practical consequences: (1) any downstream tool that knows the LLaMA tokenizer immediately works, including bitsandbytes quantization, TensorRT-LLM export, and speculative decoding; (2) the action tokens inherit pretrained embedding statistics, so the model starts training from a non-degenerate initialization rather than a randomly seeded slice, accelerating convergence.

Q2: You finetune OpenVLA on 500 demonstrations of a 6-DOF Franka arm (no wrist camera). What two pieces of the released pipeline must you modify before training will succeed?

Show Answer

First, the action tokenizer is hard-coded for a 7-DOF action space (3 translation, 3 rotation, 1 gripper); a 6-DOF arm with no gripper or an alternate convention needs the n_action_dims and bin edges adjusted so each DOF's slice maps to its own quantile-based bins. Second, the released processor expects two camera streams (third-person plus wrist); with only a third-person camera you must drop the wrist-image branch or zero-fill it, ideally by editing the processor configuration so the model attends to a single image stack rather than failing on missing inputs. Both modifications are small but mandatory; ignoring either causes silent shape mismatches at the loss head.

Q3: OpenVLA at 25 Hz on TensorRT-LLM still has a 40 ms forward-pass latency. Sketch why the policy nevertheless cannot react to a sudden obstacle appearing 100 ms before contact. (Hint: think about action chunking.)

Show Answer

If the policy emits action chunks of horizon H=8 and replans every 200 ms, the robot follows an open-loop trajectory between replans. An obstacle that appears 100 ms before contact can fall within the open-loop window, so the policy has already committed to a trajectory that does not see the new obstacle. Forward-pass latency is only one component of reactivity; the dominant component is the chunk-execution window. The mitigation is either smaller chunks (more frequent replans at higher inference cost) or a separate fast reactive safety layer (a 1 kHz collision-avoidance controller) that can preempt the chunk when a new sensor reading invalidates the plan.

What's Next

Continue to Section 24.3: Physical Intelligence pi-0 / pi-0.5.

You now know what a generalist VLA looks like at the source-code level. Section 24.3 jumps to Physical Intelligence's pi-0 and pi-0.5, which replace OpenVLA's discrete-token action head with a flow-matching network that emits continuous trajectories. The architectural delta is small but the trade-off space is interestingly different, and pi-0.5 has become the de facto reference for dexterous manipulation in 2026.

Further Reading

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. "ICRA 2024, arXiv:2310.08864".

Zhai, X., et al. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP). "ICCV 2023, arXiv:2303.15343".

Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. "TMLR 2024, arXiv:2304.07193".

Walke, H., et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale. "CoRL 2023, arXiv:2308.12952".

Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. "arXiv:2307.09288".

Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. "NeurIPS 2023, arXiv:2305.14314".