Section 61.4

Models

"Dense, sparse, or somewhere awkwardly in between: pick the base checkpoint whose license your lawyer hasn't redlined yet."

FrontierFrontier, Base-Checkpoint-Shopping AI Agent
Big Picture

The model landscape for systems-at-scale work splits five ways. First, dense open-weight base models (Llama-3.1 / 3.2 / 4, Mistral-Large / Medium, Qwen2.5 / 3, Yi-Large, Falcon-180B, DBRX): the foundation checkpoints that production fine-tuning and self-hosted serving start from. Second, MoE (mixture-of-experts) checkpoints (DeepSeek-V3 / R1, Mixtral 8x22B, Snowflake Arctic, Grok-1.5): the sparse-activation architecture that has come to dominate the cost-per-FLOP frontier. Third, long-context models (Gemini 2.5 with multi-million context, Claude 4.5 with 1M, Yi-1.5 200K, Llama-4 long-context variants): the checkpoints engineered specifically for very long sequences. Fourth, small / distillation-target models (Llama-3.2 1B / 3B, Phi-3.5, Gemma-2 / 3 2B, SmolLM2): the checkpoints aimed at edge deployment that are also the right starting point for distillation. Fifth, the proprietary frontier (GPT-5, Claude Opus / Sonnet 4.5, Gemini 2.5 Pro, Grok-4): API-only models that anchor the frontier but cannot be self-hosted. The choice between these axes is driven by serving-cost economics (active parameter count), capability ceiling, license terms, and ecosystem support.

At training scale, the model choice has different implications than at agent scale. An agent picks a model and consumes it via API; a scale practitioner often starts from a base checkpoint and continues pretraining, fine-tunes, distills, or rebuilds from scratch. The relevant attributes therefore extend beyond benchmark performance: license terms (can you train on the weights and distribute the trained checkpoint?), architecture details (is the tokenizer compatible with your domain? what is the position embedding?), ecosystem support (does your training framework have a reference recipe? does your inference framework have an optimized kernel?), and provenance disclosure (what was the pretraining data, allowing or preventing certain downstream use cases).

61.4.1 Open-weight dense base models

The 2024-2026 dense open-weight base models form the foundation of most self-hosted and fine-tuned production deployments. The four families that dominate: Meta's Llama, Alibaba's Qwen, Mistral, and the increasingly active long tail (Yi, Falcon, Phi-large, DBRX).

61.4.2 Mixture-of-experts (MoE) open-weight models

MoE has emerged as the dominant frontier architecture in 2024-2026 because it decouples the model's total parameter count (which determines storage and pretraining FLOPs) from the active parameters per token (which determines inference cost). A 671B-total / 37B-active model has the inference cost of a 37B dense and the capability of something between a 70B and a 200B dense. This has reshaped the open-weight landscape.

Underneath the marketing names, every modern MoE layer follows the same gated routing equation. Given a token representation $x \in \mathbb{R}^{d}$, a learned router $W_r \in \mathbb{R}^{N_{\text{experts}} \times d}$ produces logits, the top-$k$ experts (typically $k=2$ for Mixtral or $k=8$ for DeepSeek-V3) are selected, and the output is the weighted sum of their expert FFNs:

$$ g(x) = \mathrm{softmax}\big(\mathrm{TopK}(W_r x,\,k)\big), \qquad y = \sum_{i \in \mathcal{T}_k(x)} g_i(x)\,\mathrm{FFN}_i(x). $$

Here $\mathcal{T}_k(x)$ is the set of indices of the $k$ largest logits. Only those $k$ FFNs are activated per token, which is why the active parameter count is roughly $(k / N_{\text{experts}}) \cdot P_{\text{ffn}}$ rather than the full $P_{\text{ffn}}$. The diagram below traces a single token's path through a two-of-eight router; the same pattern scales to DeepSeek-V3's 256 experts.

Top-2-of-8 MoE routing on a single token. The router emits softmax weights, the top-$k$ experts are activated, and their outputs are combined into $y$. The other six experts contribute nothing to this token, so the active FLOP count is $2/8$ of a dense layer of equivalent size.
Figure 61.4.1: Top-2-of-8 MoE routing on a single token. The router emits softmax weights, the top-$k$ experts are activated, and their outputs are combined into $y$. The other six experts contribute nothing to this token, so the active FLOP count is $2/8$ of a dense layer of equivalent size.
import torch, torch.nn as nn, torch.nn.functional as F

class TopKMoELayer(nn.Module):
    """Minimal top-k MoE FFN layer, the building block under Mixtral / DeepSeek."""
    def __init__(self, d_model=512, d_ffn=2048, n_experts=8, k=2):
        super().__init__()
        self.k = k
        self.router = nn.Linear(d_model, n_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(nn.Linear(d_model, d_ffn), nn.SiLU(),
                          nn.Linear(d_ffn, d_model))
            for _ in range(n_experts)
        ])

    def forward(self, x):  # x: (B, T, d_model)
        logits = self.router(x)                              # (B, T, N)
        topk_w, topk_i = logits.topk(self.k, dim=-1)         # (B, T, k)
        gates = F.softmax(topk_w, dim=-1)                    # renormalized
        y = torch.zeros_like(x)
        for slot in range(self.k):
            idx = topk_i[..., slot]                          # (B, T)
            gate = gates[..., slot].unsqueeze(-1)            # (B, T, 1)
            # Dispatch each token to its chosen expert; in production this
            # is a fused all-to-all kernel (Megablocks / Tutel).
            for e_id, expert in enumerate(self.experts):
                mask = (idx == e_id).unsqueeze(-1)
                if mask.any():
                    y = y + mask * gate * expert(x)
        return y

Code Fragment 61.4.2: Reference top-$k$ MoE FFN. Production stacks replace the inner Python loop with the Megablocks block-sparse kernel or the DeepSpeed-MoE all-to-all dispatcher; the math is identical, the kernel just avoids materializing zero rows.

Numeric Example: Mixtral 8x7B Active Parameters

Mixtral 8x7B has $N_{\text{experts}} = 8$ FFNs of about 5.6B parameters each plus 4B of shared attention and embeddings, for a total of roughly $4 + 8 \times 5.6 \approx 49$B parameters. With $k = 2$ activated per token, the per-token active count is $4 + 2 \times 5.6 \approx 15$B. That is why Mixtral's inference latency on an A100 sits near a 13B dense Llama-2 while its capability tracks a 30B-to-70B dense. The same accounting applied to DeepSeek-V3 ($N = 256$ routed plus 1 shared, $k = 8$): $P_{\text{total}} = 671$B, $P_{\text{active}} = 37$B, ratio $37/671 \approx 5.5\%$, which is the sparsity that lets a frontier-class checkpoint serve at mid-tier dense cost.

61.4.3 Long-context models

Context length has become a primary capability axis in 2024-2026, with the frontier crossing the 1M-token threshold and 10M+ becoming feasible. The relevant models:

61.4.4 Small models and distillation targets

The 2024-2026 small-model class (under 4B parameters) has become a serious deployment category, driven by edge deployment, mobile inference, and distillation pipelines. Models in this class often serve as distillation targets (a larger teacher producing trained data for the smaller student) rather than being trained from scratch.

61.4.5 Frontier proprietary (API-only) models

The proprietary frontier in 2026 is dominated by Anthropic Claude, OpenAI GPT, Google Gemini, and xAI Grok, all available only via API. Their scale-architecture details are partially disclosed in technical reports and observed behavior.

61.4.6 Comparing the 2026 model families

Table 61.4.1a: 61.4.1 Open-weight base models (mid-2026), serving footprint and license.
Model Architecture Active / Total Context License
Llama-3.1 405B Dense 405B / 405B 128K Llama (MAU restriction)
Llama-3.1 70B Dense 70B / 70B 128K Llama
Llama-4 Maverick MoE 17B / 400B 1M Llama
Llama-4 Scout MoE 17B / 109B 10M Llama
DeepSeek-V3 MoE 37B / 671B 128K DeepSeek (commercial OK)
Mixtral 8x22B MoE 39B / 141B 64K Apache 2.0
Mistral Large 2 Dense 123B / 123B 128K Mistral Research (commercial: paid)
Qwen2.5 72B Dense 72B / 72B 128K Qwen (commercial OK)
Qwen3 72B Dense 72B / 72B 128K Apache 2.0
Yi-1.5 34B (200K) Dense 34B / 34B 200K Yi (commercial OK)
Falcon-180B Dense 180B / 180B 2K (limited) Falcon-180B TII License
DBRX MoE 36B / 132B 32K Databricks Open Model

61.4.7 Serving footprint and hardware fit

Choosing a model for a self-hosted production deployment is heavily constrained by the hardware footprint required to serve it. The 2026 rough cost-and-hardware sizing:

Key Insight
MoE memory: total parameters live, active parameters compute

The most important MoE concept for serving sizing: all the experts must be in GPU memory (or fast-tiered storage), even though only a few activate per token. DeepSeek-V3's 671B total parameters require ~1.3TB of FP16 memory across the experts, regardless of how few are active per token. The active-parameter advantage is in compute (and therefore latency and throughput per query), not in memory. The 2025-26 trend toward expert offloading (keeping cold experts on CPU RAM or NVMe and paging in on demand) is partial relief but introduces latency tradeoffs. When sizing MoE deployments, plan for total-parameter memory; when planning throughput, plan for active-parameter compute.

Library Shortcut
vLLM as the default self-hosted inference engine

Once you have picked a model from the table above, vllm (Kwon et al., Berkeley, 2023+) is the canonical serving runtime. PagedAttention solves KV cache fragmentation, continuous batching keeps the GPU saturated across heterogeneous request lengths, and the OpenAI-compatible HTTP server means existing clients work unmodified. Move to SGLang when you need prefix-cache-heavy chat workloads or RadixAttention; reserve TensorRT-LLM for NVIDIA enterprise deployments where peak per-GPU throughput is the gating metric.

Show code
pip install "vllm>=0.6.0"
# Option A: one-line OpenAI-compatible server
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768

# Option B: programmatic use in Python
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct",
          tensor_parallel_size=4, quantization="fp8")
outputs = llm.generate(
    ["Summarize the news of the day."],
    SamplingParams(temperature=0.7, max_tokens=512),
)
print(outputs[0].outputs[0].text)
Code Fragment 61.4.1b: Add --enable-prefix-caching for RAG and multi-turn workloads where the system prompt repeats; the prefix cache pays back its KV memory cost within a few requests.

61.4.8 Mapping the 2026 model landscape

Mid-2026 LLM model landscape grouped by serving footprint into frontier proprietary, frontier open-weight, mid-size open-weight, multimodal, and reasoning specialist categories with representative releases per group
Figure 61.4.2a: The 2026 LLM model landscape grouped by serving footprint, from edge-class small models through mid-sized open weights up to flagship closed and frontier models that need multi-node hosts.

61.4.9 Distillation pipelines and model derivation

A second-order use of these checkpoints is as teachers in distillation pipelines: large model generates supervised training data, smaller model is trained on it, the resulting small model often substantially outperforms a small model trained directly on the original corpus. The 2024-2026 reference patterns:

61.4.10 Licensing and commercial use

The 2026 open-weight licensing landscape has three rough tiers:

The 2026 standard practice is to keep at least two open-weight checkpoints in active production support (e.g., Llama-3.1 70B and Qwen2.5 72B) so that a downstream license change does not strand the deployment.

61.4.11 Multimodal base models

The 2024-2026 multimodal landscape is increasingly tied to the language-model landscape, with most frontier and open-weight families releasing vision-language variants.

61.4.12 Fine-tuning readiness and ecosystem maturity

The 2026 ecosystem readiness of each base model determines how easily a production team can adapt it. The major dimensions:

The ecosystem-maturity-versus-capability tradeoff is real: a newer, more capable model with weaker tooling may net to slower delivery than an older, slightly-less-capable model with mature tooling. Production teams typically pick along the Pareto frontier (Llama-3.1 70B and Qwen2.5 72B are typical 2026 picks because they are simultaneously capable and well-tooled).

61.4.13 Cost curves and the 2026 model economics

The 2026 model economics from the deployment side reveal an interesting pattern. API-only frontier models (Claude Opus 4.5, GPT-5, Gemini 2.5 Pro) have per-1M-token costs in the rough $3 to $20 range for input, $15 to $75 for output. Self-hosted open-weight models on rented infrastructure can be substantially cheaper at high utilization: Llama-3.1 70B on a single H100 80GB at $2/hour serving 1M tokens of mixed input/output works out to perhaps $0.50 to $1.50 per 1M tokens at high batch utilization. DeepSeek-V3 self-hosted is similar to Llama 70B per-query but with substantially higher minimum capacity (the 671B-total parameters require multi-GPU infrastructure that idles expensively at low utilization).

The crossover utilization where self-hosted becomes cheaper than API depends sharply on the model size and the hardware. The rough 2026 rule of thumb:

Plan the capacity model before picking the model. The model that wins on a benchmark may not win on $/token at your actual usage profile.

61.4.14 Model evaluation checklist

When picking a base model for production fine-tuning or deployment, the questions that matter most:

Real-World Scenario
Choosing between Llama-3.1 70B and DeepSeek-V3 in 2026

A 2026 production team was deciding between Llama-3.1 70B (dense, 70B active) and DeepSeek-V3 (MoE, 37B active / 671B total) for self-hosted chat-assistant deployment. The evaluation: (a) DeepSeek-V3 had higher per-query capability on the team's benchmarks; (b) DeepSeek-V3 required substantially more total GPU memory (8x H200 versus 2x H100); (c) DeepSeek-V3 had per-query inference cost competitive with the much smaller Llama 70B once amortized over the larger memory footprint. The deciding factor was query volume: at low query volume, the memory-cost overhead of DeepSeek-V3 dominated (idle GPUs are expensive), and Llama 70B was cheaper per dollar of total infrastructure; at high query volume, the per-query throughput advantage of DeepSeek-V3 paid back. The team chose Llama 70B for v1 (low initial query volume) with a planned migration to DeepSeek when query volume justified it. The lesson generalizes: MoE economics depend on utilization, and a model with lower per-query active parameters but higher total parameters is the right choice only when you can keep the experts hot.

Warning: Open-weight models are not unmaintained

A subtle 2026 issue is that "open weights" does not mean "free of vendor maintenance dependencies." Even fully open models depend on the originating lab continuing to release updated versions, vulnerability fixes (e.g., safety-related re-training), and ecosystem integrations. A team that picks Yi-1.5 in 2024 and finds that 01.AI's release cadence has slowed in 2026 is effectively maintaining the model themselves. The 2026 production-safer practice is to track each open-weight model's release cadence as a vendor health metric, and to keep at least one alternative in active production-readiness so a downstream slowdown does not strand you. The closed frontier APIs have their own version of this risk: model deprecation announcements (OpenAI in particular has deprecated specific GPT-3 and GPT-3.5 endpoints with limited notice) require contingency planning.

What's Next?

In the next section, Section 61.5: External Reading and Communities, we build on the material covered here.

Further Reading
Dubey, A. et al. (2024). "The Llama 3 Herd of Models." arXiv preprint arXiv:2407.21783. arxiv.org/abs/2407.21783. The Llama 3 technical report; the canonical reference for the 8B / 70B / 405B model family and its training recipe.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv preprint arXiv:2412.19437. arxiv.org/abs/2412.19437. The DeepSeek-V3 paper documenting the 671B MoE architecture, FP8 training stack, and multi-token prediction.
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv preprint arXiv:2501.12948. arxiv.org/abs/2501.12948. The DeepSeek-R1 paper; the canonical reference for RL-based reasoning training and the R1-Distill family.
Jiang, A.Q. et al. (2024). "Mixtral of Experts." arXiv preprint arXiv:2401.04088. arxiv.org/abs/2401.04088. The Mixtral 8x7B paper; the canonical reference for the sparse MoE architecture that anchored 2024 open MoE work.
Yang, A. et al. (2025). "Qwen2.5 Technical Report." arXiv preprint arXiv:2412.15115. arxiv.org/abs/2412.15115. The Qwen2.5 technical report; canonical reference for the family that anchors much of the open-weight Asian-research-community ecosystem.
Abdin, M. et al. (2024). "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." arXiv preprint arXiv:2404.14219. arxiv.org/abs/2404.14219. The Phi-3 technical report; the canonical reference for the synthetic-data-heavy small-model training approach.