Section 24.1: VLA Architecture in One Equation

"I learned to read a wrist camera in three weeks. Learning what to do with a wrist camera took the other 47."

Big Picture

A Vision-Language-Action (VLA) model is, in one sentence, a multimodal LLM whose vocabulary has been extended with motor tokens so the same softmax that picks "Paris" given "The capital of France is" picks a robot joint angle given a wrist-camera image and the instruction "place the red block on the plate". This section nails down the single equation that defines VLAs, walks through action tokenization as the engineering trick that makes it work, and explains why every architectural primitive you absorbed from the text-only transformer transfers without modification to the embodied setting.

Prerequisites

This section assumes the next-token factorization from Section 6.2, the multimodal token-fusion patterns from Section 22.7, and a working intuition for KV cache mechanics from Section 9.4.

24.1.1 The VLA Equation

Fun Fact

OpenVLA's reference implementation by Stanford and Toyota Research (Kim et al., 2024) is exactly 1,792 motor-bin tokens added to a standard Llama-2-7B vocabulary, chosen because 1,792 is divisible by 256 (8 dimensions x 256 bins for the WidowX arm). The choice of 256 bins per dimension was reportedly because "that's how many colors fit in a byte", a digital-graphics joke that escaped into a robotics paper.

A VLA models a single conditional distribution. Let I denote the multi-view image stack the robot's cameras produce at a timestep (typically a wrist-mounted view plus a third-person view), let \ell denote the natural-language instruction, and let a_{1:H} denote the action chunk the policy must emit over a horizon of H control steps. The VLA factorizes the action distribution autoregressively:

p_theta(a_{1:H} | I, l) = prod_{t=1..H} p_theta(a_t | I, l, a_{1:t-1})

Code Fragment 24.1.1: The defining VLA equation, written as Python pseudocode. The action chunk a_{1:H} is factorized autoregressively given the image stack I and instruction l, exactly the next-token chain rule that defines a text LLM, with action tokens substituted for word-piece tokens.

Read that equation carefully. The right-hand side is exactly the next-token factorization from Section 6.2, with two differences that are not changes to the math: (a) the conditioning context contains image tokens in addition to text tokens, and (b) the symbols being predicted are drawn from a vocabulary slice reserved for motor commands. Everything that follows in this chapter is engineering glue around that single factorization.

Key Insight: A VLA is a chat model whose tool is its body

Replace "vocabulary of 50k BPE pieces" with "vocabulary of 50k BPE pieces plus 1,792 motor bins". Replace "user message" with "front-camera image, wrist-camera image, instruction string". Replace "assistant message" with "an action chunk that decodes into joint deltas". The transformer trunk, the KV cache, the cross-entropy loss, FlashAttention, RoPE, and speculative decoding all transfer with zero source changes. If you can read the GPT-2 reference implementation, you can read the OpenVLA reference implementation.

24.1.2 Action Tokenization: BPE for the Body

The crux of "extend the vocabulary with motor tokens" is making a discrete vocabulary out of a continuous motor signal. Robots in the manipulation regime have somewhere between 6 and 14 controllable degrees of freedom (DOF). For a single-arm manipulator with a gripper, the action vector is 7-dimensional: three Cartesian deltas (dx, dy, dz), three rotation deltas (droll, dpitch, dyaw), and one gripper signal (open or close). The action tokenizer quantizes each dimension into K bins (RT-2 and OpenVLA both use K=256) and assigns a unique vocabulary index to each bin of each dimension. Seven DOF times 256 bins gives 1,792 reserved indices, a rounding error against a 32k or 128k text vocabulary.

Formally, the per-dimension quantization is uniform over the action range observed in training data:

b_d(a_d) = \left\lfloor \frac{a_d - \mathrm{lo}_d}{\mathrm{hi}_d - \mathrm{lo}_d} \cdot (K - 1) \right\rfloor, \qquad \mathrm{token}(d, b) = V_{\text{base}} + d \cdot K + b,

where $\mathrm{lo}_d, \mathrm{hi}_d$ are the 1st and 99th percentiles of action dimension $d$ in the training set, $V_{\text{base}}$ is the first reserved vocabulary index (often the start of a "least-used" range of an existing BPE vocabulary), and $K = 256$. Decoding inverts the bin and indexes back into the dimension via $d = \lfloor (\text{token} - V_{\text{base}}) / K \rfloor$ and $b = (\text{token} - V_{\text{base}}) \bmod K$.

VLA action tokenization: 7 continuous DOFs → 7 discrete token IDs per timestep — **Figure 24.1.1a**: Action tokenization the RT-2 / OpenVLA way. Each of seven continuous DOFs is mapped through a per-dimension uniform quantizer into 256 bins, then translated to a unique vocabulary ID by offsetting from a reserved base index. Seven action tokens are appended to the model's text token stream per timestep, and a standard next-token loss trains the model to predict them.

The forward pass produces seven action tokens per timestep, decoded by inverting the quantization:

import numpy as np

class ActionTokenizer:
    """7-DOF end-effector tokenizer (OpenVLA / RT-2 convention).

    Each dimension is quantized into K bins. Bin index b for dimension d
    maps to vocabulary index   vocab_base + d * K + b.
    """

    def __init__(self, vocab_base: int = 31744, n_bins: int = 256, n_dof: int = 7):
        self.vocab_base = vocab_base
        self.n_bins = n_bins
        self.n_dof = n_dof
        # Per-dimension normalization derived from training-data quantiles.
        self.lo = np.array([-0.05, -0.05, -0.05, -0.4, -0.4, -0.4, 0.0])
        self.hi = np.array([ 0.05,  0.05,  0.05,  0.4,  0.4,  0.4, 1.0])

    def encode(self, action: np.ndarray) -> np.ndarray:
        x = np.clip((action - self.lo) / (self.hi - self.lo), 0.0, 1.0)
        bins = np.floor(x * (self.n_bins - 1)).astype(np.int32)
        return self.vocab_base + np.arange(self.n_dof) * self.n_bins + bins

    def decode(self, tokens: np.ndarray) -> np.ndarray:
        offset = tokens - self.vocab_base
        bins = offset % self.n_bins
        x = bins.astype(np.float32) / (self.n_bins - 1)
        return self.lo + x * (self.hi - self.lo)

tok = ActionTokenizer()
a = np.array([0.02, -0.01, 0.03, 0.0, 0.1, -0.2, 1.0])
ids = tok.encode(a)
print(ids)          # 7 vocabulary indices, all in [31744, 33535]
print(tok.decode(ids))  # roughly equal to a, modulo 1/256 quantization

Output: [31935 31807 32129 32384 32672 32896 33535] [ 0.01960784 -0.01000000 0.02980392 0.00000000 0.10039216 -0.20078431 1.00000000]

Code Fragment 24.1.2a: The seven-line tokenizer is the entire conceptual content of "action tokenization". Encode quantizes a 7-D action into seven vocabulary IDs; decode inverts the bin lookup. This is byte-pair encoding generalized from text to continuous motor signal, and it is how RT-2 (Brohan et al., 2023, arXiv:2307.15818) and OpenVLA (Kim et al., 2024, arXiv:2406.09246) extend a pretrained chat model into a robot policy.

Key Insight: Why discretization wins over regression

Treating actions as discrete tokens, rather than regressing continuous values, lets the policy emit multimodal distributions. If two valid grasps lie 10 cm apart, a Gaussian regression head averages them and the gripper crashes between objects. A categorical softmax happily puts mass on both bins, and sampling picks one. This is why nucleus sampling and beam search from Chapter 4 transfer to robotics: choosing among multimodal trajectories is the same problem as choosing among multimodal sentence continuations.

24.1.3 The End-to-End Policy

The full forward pass strings four off-the-shelf components together. A vision encoder (SigLIP, DINOv2, or a fused pair) consumes each camera view and emits a sequence of image-patch tokens. The instruction string is tokenized by the LLM's existing BPE tokenizer. The two streams are interleaved with a small set of separator tokens and fed into a decoder-only transformer. The LM head produces logits over the unified text-plus-action vocabulary. At inference time you greedy-decode (or low-temperature sample) seven tokens, run them through the inverse action tokenizer, and ship the resulting 7-D vector to the robot's low-level controller.

Stage	Input	Output	Reused from text LLM?
Vision encoder	Multi-view images	~256 patch tokens	Yes (ViT / SigLIP)
Text tokenizer	Instruction string	~32 BPE tokens	Yes (LLaMA BPE)
Transformer trunk	Interleaved token sequence	Hidden states	Yes (LLaMA 7B/13B)
LM head	Last hidden states	Vocab logits (text + action)	Yes (Linear + softmax)
Action detokenizer	7 sampled token IDs	7-D end-effector delta	No (the only new piece)

Figure 24.1.2b: The five-stage pipeline. Only the final detokenizer is robotics-specific. Every other stage is borrowed wholesale from the text-only LLM stack.

The reference pseudocode for a single control step is short enough to fit on one screen:

def vla_step(model, tokenizer, action_tokenizer,
             wrist_img, third_img, instruction):
    # 1. Encode vision and text into a unified token sequence.
    v_tokens = model.vision_encoder([wrist_img, third_img])
    t_tokens = tokenizer.encode(f"In: {instruction}\nOut:")
    prefix = torch.cat([v_tokens, t_tokens], dim=1)

    # 2. Greedy-decode exactly n_dof tokens from the action slice.
    action_ids = []
    past = None
    inputs = prefix
    for d in range(action_tokenizer.n_dof):
        out = model(inputs, past_key_values=past)
        past = out.past_key_values
        logits = out.logits[:, -1]
        # Mask the logits to the slice for dimension d (forces a valid action token).
        lo = action_tokenizer.vocab_base + d * action_tokenizer.n_bins
        hi = lo + action_tokenizer.n_bins
        tok_id = lo + logits[:, lo:hi].argmax(dim=-1)
        action_ids.append(tok_id)
        inputs = tok_id.unsqueeze(-1)

    # 3. Detokenize seven IDs into a 7-D end-effector delta.
    return action_tokenizer.decode(torch.stack(action_ids).cpu().numpy())

Code Fragment 24.1.3: One control step. The logit mask on each iteration restricts decoding to the slice for the current DOF, so the policy cannot accidentally emit a text token while sampling motor commands. This is the constrained-decoding pattern from Chapter 12 applied to action selection.

24.1.4 Action Chunking and Receding-Horizon Control

Predicting one action and immediately running it round-trips through the network at every control step, which the bottleneck is the transformer forward pass. The fix is action chunking: predict an entire horizon a_{1:H} at once, execute the first few, and re-plan. This is the receding-horizon pattern from classical model-predictive control, except the model is a 7B-parameter transformer instead of a 50-line MPC solver. Chunking lets a policy that forward-passes at 5-10 Hz nevertheless emit smooth 30 Hz control by interpolating between the predicted waypoints.

Real-World Scenario: Why chunk size matters for safety

If H=8 and you forward-pass at 5 Hz, the robot follows an open-loop trajectory for 1.6 seconds before the next replan. A human walking into the workspace during that window will be hit by an obstacle-blind robot. Production VLAs at 2026 robotics startups (Physical Intelligence, Skild, Figure) typically set H so the replan period is at most 200 ms and pair the policy with a separate reactive safety layer that can halt execution mid-chunk. The chunking knob trades smoothness against reactivity.

24.1.5 Where the Equation Lives in the Broader Stack

The factorization p_theta(a_{1:H} | I, l) sits at the lowest layer of a three-tier robotics stack that you will see repeatedly through Chapters 39 and 40. At the top is a planning LLM (covered in Section 24.7 on SayCan and Section 24.8 on Code-as-Policies) that decomposes a high-level instruction like "tidy the kitchen" into subgoals. In the middle is the VLA policy that turns each subgoal plus current pixels into actions. At the bottom is a 1 kHz low-level controller (PID, impedance, or operational-space) that turns the VLA's coarse 7-D commands into 1 kHz joint torques. Most of this chapter focuses on the middle layer; the top layer is Chapter 39's job; the bottom layer lives in textbooks on classical control and is, refreshingly, mostly unchanged by the LLM revolution.

Layer	Rate	Input	Output	Implementation
Planner LLM	0.1-1 Hz	Goal + context	Subgoal sequence	GPT-4o, Claude, Gemini via API
VLA policy	5-20 Hz	Subgoal + pixels	End-effector deltas	OpenVLA, pi-0, RT-2-X
Low-level controller	500-1000 Hz	Deltas + state	Joint torques	Operational-space PID

Figure 24.1.3a: The three-tier robotics stack. The VLA equation governs only the middle layer. Mixing rate-mismatched tiers requires the action chunking covered in Section 4.

Research Frontier

Diffusion policies versus discrete-token policies

The discrete-token formulation in Section 1 is one of two dominant approaches in 2026 VLAs. The other is the diffusion policy, where p_theta(a_{1:H} | I, l) is parameterized by a denoising network and sampled by iterative refinement instead of autoregressive decoding. Physical Intelligence pi-0 (covered in Section 24.3) uses flow matching, a close cousin of diffusion that produces continuous-valued actions directly without binning. The two formulations are equivalent in expressive power but make different trade-offs on inference latency, training stability, and the smoothness of emitted trajectories. The empirical picture in 2026 is that both work; the choice is driven by team familiarity and inference-stack preferences more than by hard performance gaps.

24.1.6 Why the Text Vocabulary Survives Action Finetuning

A reasonable worry: extending the vocabulary with motor bins and finetuning on robot data should catastrophically forget the text capabilities of the base LLM. Empirically, this does not happen, for two reasons. First, the text vocabulary is orders of magnitude larger than the action vocabulary, so the gradient pressure on text tokens from action-data batches is small. Second, the standard recipe (RT-2, OpenVLA, pi-0) is to co-train on a mix of robot demonstrations and a small fraction of text-only and vision-language data; this regularizes the trunk. The practical consequence is that a VLA can still hold a conversation about the scene it sees, which is the substrate that makes "explainable robot actions" a feature rather than a research aspiration. You can ask the policy "why did you grasp the cup that way?" mid-trajectory and get a coherent answer from the same model that is also emitting motor tokens.

Warning: The biggest gotcha: action-space mismatch

A VLA trained on one robot's action space (say, the WidowX 7-DOF end-effector) does not transfer to another robot's action space (say, a 6-DOF arm with a different gripper convention) without retokenization and at least light finetuning. The action vocabulary is robot-specific. The cross-embodiment generalization that RT-2-X and pi-0.5 promise (Section 24.3, Section 24.4) is delivered by training on data from many robots with a unified action space, not by zero-shot vocabulary reuse. Mixing this up costs robotics startups months of debugging time when they expand from one hardware platform to a second.

Key Takeaway

Key Insight

A VLA is fully specified by one equation, p_theta(a_{1:H} | I, l) = prod_t p_theta(a_t | I, l, a_{1:t-1}), which is the same next-token factorization that defines text LLMs. The only new engineering piece is action tokenization, a 30-line module that quantizes continuous motor commands into a vocabulary slice. Everything else, vision encoders, transformers, sampling, KV caches, transfers unchanged from the text-only stack. This is why robotics in 2026 hires LLM engineers rather than reinventing them.

Self-Check

Q1: If a robot has 14 controllable DOF and you use 256 bins per DOF, how many vocabulary indices must you reserve for actions? What fraction of a 128k Llama-3 vocabulary does that consume?

Show Answer

You need 14 times 256 equals 3,584 action vocabulary indices, one slice per DOF. Against a 128k Llama-3 tokenizer, that consumes about 3,584 / 128,000 = 2.8% of the vocabulary. This is small enough that the text capabilities of the base model are not meaningfully eroded, especially when co-trained with text-only batches. Most implementations recycle the last 3,584 tokens of the existing vocabulary rather than growing the embedding table, which avoids re-initializing rows and preserves checkpoint compatibility.

Q2: Explain why a Gaussian regression head over actions fails on a bimodal grasp distribution, but a categorical softmax over 256 bins succeeds. Tie your answer to the entropy of the predicted distribution.

Show Answer

A Gaussian head parameterizes the action distribution by a mean and variance, which is unimodal by construction. If the human demonstrations contain two valid grasps (approach from the left or from the right), the mean ends up between the two modes (a third invalid grasp from the middle), and the variance balloons to cover both. The categorical softmax over 256 bins can place probability mass on two non-adjacent bins independently; its predictive entropy is bounded by log(256) and it can represent arbitrary multimodal shapes. The categorical head's flexibility is the same reason language models prefer softmax over Gaussian heads: real action and token distributions are rarely unimodal.

Q3: You are debugging a VLA that emits a text token in the middle of an action chunk, crashing the detokenizer. Sketch a one-line fix using the constrained-decoding mask from Code Fragment 24.1.4.

Show Answer

The fix is to apply the per-DOF logit mask unconditionally during action decoding. Add a line that masks logits outside the slice [lo, hi] to negative infinity before argmax: logits = logits.masked_fill(~slice_mask, float('-inf')) where slice_mask marks only indices in [vocab_base + d * n_bins, vocab_base + (d+1) * n_bins). With that mask the argmax can never escape to a text token, even if the unconstrained probability for some BPE token is higher on a noisy input. This is the standard constrained-decoding pattern applied at the sampling step rather than at training time.

What's Next

Continue to Section 24.2: OpenVLA-7B Reference Implementation.

Section 24.1 gave you the equation; Section 24.2 opens the hood on OpenVLA-7B, the open-weights reference implementation that ships with PyTorch source you can run today. You will see the vision encoder choice (fused SigLIP+DINOv2), the training-data recipe (Open X-Embodiment), and the inference performance characteristics (about 6 Hz on an A100) in concrete code.

Further Reading

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. "arXiv:2307.15818".

Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. "arXiv:2406.09246".

Black, K., et al. (2024). pi-0: A Vision-Language-Action Flow Model for General Robot Control. "Physical Intelligence Technical Report".

Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. "ICRA 2024, arXiv:2310.08864".

Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. "ICML 2023, arXiv:2303.03378".

Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. "IJRR 2024, arXiv:2303.04137".

Zhao, T. Z., et al. (2024). ALOHA Unleashed: A Simple Recipe for Robot Dexterity. "CoRL 2024, arXiv:2410.13126".