Section 58.5: Training-Inference Co-Design

"Training and inference used to be separate departments. In 2026 they share a Slack channel and an architectural budget."
Sched, Co-Design-Native AI Agent

Note: Learning Objectives

Explain why inference-aware scaling shifts the optimum toward smaller models trained on more tokens.
Walk through the training-versus-inference asymmetry that makes MoE economically attractive.
Compare speculative decoding (Medusa, EAGLE), inference-aware scaling, and multi-stage pipelines as co-design strategies.
Match a product workload (volume, latency target, capability tier) to its best co-design pattern.

Big Picture

The first decade of LLM engineering treated training and inference as separate concerns. By 2026 the seam has dissolved: every training decision is also an inference decision, and the silicon, kernels, and algorithms covered in this chapter all encode that fact. Co-design is not an optimization layered on top; it is the operating frame for frontier systems work.

Prerequisites

This section assumes the LLM pretraining objective from Section 6.2 and the inference-stack mechanics from Section 9.1. The distributed-training patterns are introduced in the next chapter of this part.

Key Insight

The training-inference seam was always artificial

The cleanest summary of this chapter, and arguably of Part XII as a whole, is that the apparent dichotomy between "training compute" and "inference compute" was a historical artifact of when the field had so little of both that they had to be optimized separately. By 2026, with frontier models served to billions of users daily, the total cost-of-ownership question is dominated by amortized-inference cost; Sardana 2024 inference-aware scaling laws and Medusa/EAGLE speculative co-training are the technical expression of that economic reality. The engineering that wins in 2026-27 is the engineering that stops drawing a boundary between training and inference and starts treating them as one loop.

The first decade of large-language-model engineering treated training and inference as separate concerns. Training was a one-time event you optimized for FLOPs efficiency; inference was a continuous workload you optimized for latency and throughput. By 2026 that separation has broken down. Sardana et al. (2024) showed that the classical Chinchilla "compute-optimal" scaling laws systematically under-train the model relative to where the inference-aware optimum sits: if you serve a model billions of times, it is worth spending more training compute to save inference compute later. This insight has filtered into every frontier-model decision since 2024.

This section closes Chapter 58 with the systems-level co-design questions this implies: how training and inference share hardware, how MoE routing affects both sides, how the speculative-decoding / draft-model pattern blurs the boundary, and where 2027 will arbitrate.

Side-by-side diagram — **Figure 58.5.1**: Training and inference were once two separate stacks with two separate teams. By 2026, six layers (scaling-law math, attention kernels, quantization schedules, MoE routing, speculative draft heads, multi-stage pipelines) are co-designed across both. The fusion did not happen all at once; the bottom timeline shows when each layer crossed over. Left column shows two separate stacks circa 2023: a Training stack (Chinchilla optimum, FA-2, bf16, all params trained, H100 cluster) and an Inference stack (throughput/latency, PagedAttention, int4 post-hoc, retrofitted routing, whatever-fits hardware). A central arrow labeled '2024-26: layers fuse (Sardana, MoE, QAT, EAGLE)' points to the right column. Right column shows one fused 2026 stack with co-designed layers: inference-aware scaling laws, FA-4 attention, quantization-aware training, MoE trained-all-serve-sparse, speculative-decoding draft heads, and a multi-stage HERMES pipeline. A bottom timeline marks FA-1 in 2022 through FA-4 + HERMES in 2026.

58.5.1 Inference-aware scaling laws

Fun Fact

Speculative decoding works because most tokens are easy and a tiny draft model can guess them while a big model verifies in parallel. The trick was first published in 2022 as a curiosity and is now the default in vLLM, TGI, and SGLang, which together serve roughly half of all open-weight LLM inference on the internet. Few research ideas have traveled that fast.

The Sardana et al. paper extends Chinchilla's compute-optimal frontier to account for amortized inference cost. The result is intuitive: if your model will be served for a long time at high volume, the optimal balance shifts toward "smaller model trained longer". For frontier closed-API models served to hundreds of millions of users, the inference-aware optimum is very different from the training-only optimum. The Sardana framework explains why we see the trend toward "dense models trained on aggressive token-to-parameter ratios" (Llama-4 8B on 30T+ tokens; SmolLM2 360M on FineWeb-Edu; Liquid LFM2.5-350M on 28T tokens, 80,000:1 ratio).

Sardana inference-aware scaling shifts optimum toward more tokens per parameter — **Figure 58.5.2**: How Sardana et al. (arXiv:2401.00448, 2024) shifted the field's compute-optimal frontier. The Chinchilla training-only optimum at 20:1 tokens-per-parameter ratio (red, dashed) minimizes training cost; the inference-aware curve (blue) adds amortized inference cost over the model's serving lifetime and shifts the optimum 50x to the right, toward smaller models trained on more tokens. The 2026 reference points cluster in that shifted zone: Llama-3.1 8B near 1800:1, SmolLM2 360M and Qwen3-0.6B in the 30,000-60,000:1 band, and Liquid LFM2.5-350M setting the 2026 record at 80,000:1 on 28T training tokens. The economics that make this rational are exactly the "frontier serves to a billion users daily" condition the section's first paragraph describes.

58.5.2 MoE: where co-design pays off the most

Mixture-of-experts models break the symmetry between training and inference. During training, all experts must be loaded and gradient-updated; during inference, only the active subset is touched per token. The economic story is therefore: pay for big training (all experts in memory), get cheap inference (sparse activation). DeepSeek V4 (671B parameters, ~37B active per token) and Qwen3-235B-A22B (128 experts, top-8 routing) are the 2026 reference points. The Friendli MoE comparison walks through the routing schemes.

MoE inference: top-k router selects a small subset of experts per token — **Figure 58.5.3**: MoE routing at inference time. A lightweight router computes a softmax over N experts and selects the top-k (typically k=2 to 8). Only those experts execute, so per-token compute scales with k rather than N. Training is the inverse: all N experts receive gradient, which is what makes training cost much larger than inference cost for an MoE model.

# MoE top-k routing in PyTorch (the inference-time path).
# Demonstrates why only k expert FLOPs are paid per token even though
# N expert parameter sets live in memory during training.
import torch
import torch.nn.functional as F

batch, d_model, n_experts, k = 4, 32, 8, 2
x = torch.randn(batch, d_model)
gate = torch.nn.Linear(d_model, n_experts, bias=False)
experts = torch.nn.ModuleList(
    [torch.nn.Linear(d_model, d_model) for _ in range(n_experts)]
)

# 1) Router scores per token, top-k selection, renormalized weights.
scores = gate(x)                                # (batch, n_experts)
topk_w, topk_idx = scores.topk(k, dim=-1)       # (batch, k)
topk_w = F.softmax(topk_w, dim=-1)              # weights sum to 1 over the k

# 2) Dispatch each token to its selected experts and combine the outputs.
y = torch.zeros_like(x)
for j in range(k):
    for b in range(batch):
        e = topk_idx[b, j].item()
        y[b] += topk_w[b, j] * experts[e](x[b])

print("active experts per token :", topk_idx.tolist())
print("active FLOPs / total FLOPs:", f"{k}/{n_experts} = {k/n_experts:.2f}")

Output: active experts per token : [[3, 7], [0, 5], [2, 6], [1, 4]] active FLOPs / total FLOPs: 2/8 = 0.25

Code Fragment 58.5.1a: A minimal top-k MoE router and expert combination in PyTorch. Production implementations (Switch, Mixtral, DeepSeek-MoE) vectorize the dispatch through scatter / gather and add load-balancing losses, but the inference cost identity per-token FLOPs proportional to k, not to N is exactly the one above.

Fun Fact: Mental Model

An H100 rents for about $3 to $5 per hour in 2026, which is roughly the hourly rate of a senior staff engineer in many markets. The asymmetric part is that the H100 works 24/7 with no PTO and never asks for an offsite, but it also cannot debug its own OOM at 3 a.m. and produces nothing useful unless someone parameterized the YAML correctly. The teams that win at co-design are the ones that internalize this: every spec sheet number, every batching decision, every quantization choice maps to either a salary or a salary-equivalent burn rate, and "let it run another day to be safe" can cost as much as a sprint.

Real-World Scenario: MoE Compute vs Memory for DeepSeek V4

DeepSeek V4 advertises 671B total parameters and ~37B active per token. With 256 experts and top-8 routing, each token touches 8 / 256 = 3.1% of the expert parameter budget. The memory required at inference is roughly 671B·2 bytes ≈ 1.34 TB for the weights (FP16), so a single H200 node with 8×141 GB = 1.13 TB cannot fit the model without 4-bit quantization (which drops the weights to ~336 GB, fits comfortably). The compute per token, however, scales with the active 37B: at 2 FLOPs per parameter per token, that is ~74 GFLOPs per generated token, about 18× cheaper than a dense 671B model would cost. This is why MoE wins at high-throughput inference but loses on memory-constrained edge: the bottleneck switches from FLOPs to bytes-of-weights resident.

58.5.3 Speculative decoding and draft-model patterns

Speculative decoding (Leviathan et al., 2023; popularised in 2024-25 production) runs a small "draft" model that proposes several tokens at once, then verifies them with the target model in parallel. Practical 2-4x throughput gains at no quality cost. The pattern reaches its logical conclusion in Medusa and EAGLE, which co-train target and draft heads. The training cost is small; the inference savings are large. This is co-design done right.

58.5.4 Multi-stage inference pipelines

MIT HERMES (2025) formalises the multi-stage AI inference pipeline that 2026 production systems actually use: prefill on one fabric (often Blackwell GPU), token generation on another (often Groq LPU or Cerebras CS-3), retrieval and tool-use on a third (CPU + accelerators), and any vision / multimodal preprocessing on a fourth (specialized vision-encoder NPUs). HERMES's contribution is a scheduler that places each stage on its best-fit silicon and minimizes inter-stage copy cost.

58.5.5 Comparing co-design strategies

Table 58.5.1b: 63.5.1 Training-inference co-design patterns, 2026.

Pattern	Training cost	Inference benefit	Best example
Inference-aware scaling	Higher (more tokens)	Smaller, faster inference	SmolLM2, Liquid LFM2.5
Mixture-of-experts	High (all experts trained)	Active-fraction-only inference	DeepSeek V4, Qwen3-235B
Speculative decoding	Small draft head training	2-4x throughput	Medusa, EAGLE
Quantization-aware training	Higher (longer schedule)	4-bit / 1.58-bit inference	BitNet b1.58 2B4T
Multi-stage pipeline	None (placement-only)	Right silicon per stage	HERMES scheduler

Key Insight: The scaling wall is now an inference wall

Through 2022-23 the framing was that scaling pretraining FLOPs would hit a wall (model size, available tokens, frontier dataset quality). By 2026 the wall is on the inference side: how cheaply and quickly can a trillion-parameter model serve a billion requests per day. The Cerebras / Groq / BitNet / FlashAttention-4 cluster of work is one connected response to that question. The Sardana paper's "inference-optimal scaling" is the algorithmic counterpart. Treat "frontier" in 2026-27 as a story about inference economics as much as about model capability.

Real-World Scenario

where co-design changes a product decision

You are designing a customer-support agent for a 100,000-call-per-day product. The naive choice is "the best chat model your budget allows". The co-design choice is: a 7B fine-tune that fits in vLLM with FA-4 on a single H100 at $0.30 per million tokens, with a 70B model held in reserve for the 5% of conversations that escalate. Total cost is one tenth of the naive choice and quality on simple cases is indistinguishable. This pattern, "small fast default, big careful fallback", is the operational expression of co-design and is the right starting architecture for most 2026 products. Anthropic's Computer Use cascade (Haiku for the click-loop, Sonnet for screen reasoning, Opus 4.6 only when escalated) is the same pattern operationalized at API scale.

Warning: co-design optimizations are workload-specific

Inference-aware scaling assumes high inference volume; if you only serve a model a few hundred times, the math reverses. MoE routing benefits high-batch serving; at batch-1 the routing overhead eats the savings. Speculative decoding assumes a good draft model is available; if your task is so narrow that draft and target disagree often, the speculation accept-rate collapses and so do the savings. Profile before adopting; do not cargo-cult the patterns.

Research Frontier: Looking ahead

The unresolved co-design question of 2027 is the one this chapter has been building toward: can frontier-model training and serving be decoupled enough that decentralized training (Section 58.2) feeds centralized inference (Section 58.1), or do they remain co-located by economics? The answer determines whether the next decade of LLM infrastructure looks like cloud computing (centralized) or like the modern internet (distributed). Chapter 64 picks up the broader trajectory question: where on the 2027-2033 spectrum does AGI actually land, and what does the labor market look like along the way.

Key Takeaways

Inference-aware scaling shifts the optimum toward smaller models on more tokens (SmolLM2, Liquid LFM2.5, Llama-4 8B).
MoE pays training cost (all experts) for inference savings (sparse activation): DeepSeek V4, Qwen3-235B.
Speculative decoding with co-trained draft heads (Medusa, EAGLE) gives 2-4x throughput at no quality cost.
The right co-design pattern is workload-specific; profile before adopting any of them.

Self-Check

Q1. Under what condition does inference-aware scaling reverse direction (favoring larger models on fewer tokens)?

Show Answer

When total inference volume is low (research checkpoints, internal models with few users). The Sardana framework is amortization-driven; without high serving volume, training-side Chinchilla math dominates again.

Q2. Why does MoE's training/inference asymmetry not translate to all hardware?

Show Answer

MoE inference requires fast expert-routing memory access. Edge NPUs with limited per-expert bandwidth do not realize MoE's savings; the routing overhead and on-chip movement eat the wins. MoE pays off on HBM-equipped GPUs at high batch, not on phones.

What's Next?

In the next chapter, Chapter 59: Distributed Training Fundamentals, we continue building on the material from this chapter.

Further Reading

Sardana et al., "Beyond Chinchilla-Optimal: Accounting for Inference" (2024).

Leviathan et al., "Fast Inference via Speculative Decoding" (2023).

Cai et al., "Medusa" (2024): co-trained draft heads.

Li et al., "EAGLE" (2024): the next-step speculative-decoding architecture.

Subramanian et al., "HERMES: Heterogeneous Inference Pipelines" (MIT, 2025).

Friendli, "MoE Models Comparison" (2026 reference).