Models

Section 10.10

"A quantized model is still the same model, just one that learned to mumble. The trick is making sure it mumbles the right answer."

QuantQuant, Bit-Whisperer AI Agent

The 2026 model zoo splits cleanly into three layers. The first is the closed-API frontier: the GPT-5 family (Aug 2025), the Claude 4 family (Opus 4.5 / Sonnet 4.5 / Haiku 4.5, 2025), and Gemini 2.5 Pro and its 2025-26 successors. You can call them but never run them. The second is the open-weight frontier: Llama-3.3 and the Llama 4 family (Scout / Maverick / Behemoth previewed in 2025), Qwen3, DeepSeek-V3 / R1 / V3.1, Kimi K2, GLM-4.5, GPT-OSS, and the Mistral families. You can download them and run them anywhere your hardware allows. The third is the small open-weight tier: Pythia, SmolLM2, TinyLlama, Phi-4. You can fine-tune and inspect them on a laptop. Each layer is useful for different reasons; this section catalogues all three.

The frontier is moving fast: a model that was unbeatable in January 2026 is likely a midrange option by December. The list below names the families that mattered as of the book's print date and the URLs that will keep tracking the leaderboards as the line moves.

2026 model zoo: three layers from closed API to open frontier to small open
Figure 10.10.1: The 2026 LLM model zoo organized by three deployment archetypes. The closed-API frontier (GPT-5 family with o3 and o4-mini, Claude 4 family up to Opus 4.5, Gemini 2.5 Pro with Deep Think, Grok 4) is callable but unreachable for weight inspection. The open-weight frontier (Llama-3.3 and Llama 4, Qwen3 with 235B MoE, DeepSeek-V3 and R1 and V3.1, Mistral and Magistral, Gemma 3, Kimi K2 at 1T parameters, GLM-4.6, OpenAI's GPT-OSS, Swiss-AI Apertus) is downloadable and self-hostable. The small-open and mech-interp tier (Pythia with full intermediate checkpoints, SmolLM2, Phi-4, BitNet b1.58) supports laptop-scale research, scaling-laws replication, and CPU-only inference.

10.10.1 The closed-API frontier

Fun Fact

The first GPT paper had no name on the model card, no system prompt, no chat interface, and zero users. By the time GPT-5 shipped, OpenAI was running a model lineup so dense (GPT-4o, 4.5, 5, o3, o4-mini) that engineers maintained internal spreadsheets just to remember which one to call for which task. The lesson is that frontier "models" are now product lines, and the leaderboard you're reading was probably already out of date by the time the print run hit the warehouse.

10.10.2 The open-weight frontier

10.10.3 The mech-interp and small-open tier

10.10.4 Comparing the frontier model families

Table 10.10.1a: 12.4.1 2025-26 frontier model families.
FamilyTop variant (mid-2026)AccessContextApprox $/1M in (USD)Strength
OpenAI GPTGPT-5 family / o3API only200K-1M$2.50-$15 input, more for thinking tokensReasoning, code, multimodal
Anthropic ClaudeClaude Opus 4.5API only200K-1M$15 input / $75 output (Opus); cheaper for Sonnet / HaikuAgentic, long-form writing
Google GeminiGemini 2.5 ProAPI only1M-2M$1.25-$10 inputMultimodal, long-context
Meta LlamaLlama 4 familyOpen weights (LCL)128K-1M+$0 (self-host); $0.50-$3 via providersMost-fine-tuned base
Alibaba QwenQwen3-235BOpen weights (Apache 2.0)128K-1M$0 (self-host); $0.30-$2 via providersMultilingual, MoE, hybrid reasoning
DeepSeekDeepSeek-V3.1 / R1Open weights (MIT)128K$0 (self-host); $0.14-$1.10 via DeepSeek APICost-efficient reasoning frontier
Key Insight
Hybrid reasoning toggles are the 2025-26 frontier pattern

The defining architectural trend of 2025 is the unification of "fast" and "thinking" modes under a single model. Qwen3 exposes enable_thinking=True on the chat template; Claude 3.7+ and Claude 4 ship an extended-thinking parameter; GPT-5 ships an adaptive "think harder" tier where the model chooses how long to reason per query; Gemini 2.5 Pro adds Deep Think. The practical consequence: prompting in 2026 means deciding whether to let the model think, not just what to ask. Thinking tokens cost more (sometimes 3-5x the completion-token price on closed APIs) and add latency. Default to thinking off for retrieval and summarization; default to thinking on for math, code, and multi-step planning.

Key Insight
Why you should learn on open models, not API models

You cannot inspect, fine-tune, or quantize a closed-API model. You can only call it. Every meaningful technique in Chapters 8 through 11 (mech-interp, activation patching, embedding inspection, attention visualization) requires weights you control. The frontier closed-API models are the right tools for production in Part III; the open-weight families are the right tools for understanding here.

Library Shortcut: loading three open-weight families
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantized, fits on 24 GB. The bare load_in_4bit=True flag is
# deprecated as of transformers 4.x late 2024; use BitsAndBytesConfig.
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model_id = "meta-llama/Llama-3.3-70B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb, device_map="auto"
)

# Same pattern, different family:
# model_id = "Qwen/Qwen3-7B-Instruct"
# model_id = "deepseek-ai/DeepSeek-V3.1"
Code Fragment 10.10.1b: 4-bit quantized, fits on 24 GB.

The AutoModelForCausalLM abstraction is what makes all three loadable with the same five lines. Behind it, transformers picks the right model class (LlamaForCausalLM, Qwen3ForCausalLM, DeepseekV3ForCausalLM) automatically.

10.10.5 Picking a default for Chapter exercises

For Chapters 7 to 11: Qwen3-7B-Instruct (open weight, Apache 2.0, fits in 24 GB at 4-bit) is the recommended default base. For mech-interp work in Chapter 10: Pythia-160M or Pythia-1.4B (small enough to load fully into TransformerLens, full intermediate checkpoints available). For "what does a frontier model do" comparisons: call the current GPT-5 family / Claude Opus 4.5 / Gemini 2.5 Pro through their APIs.

Real-World Scenario
DeepSeek-R1 and the open-reasoning recipe

The Jan 2025 release of DeepSeek-R1 (671B MoE, MIT-licensed) plus the GRPO training recipe is the most-replicated open recipe of 2025. The disclosed V3 base training cost (~$5.6M) and the R1-Zero "RL from verifiable rewards" pipeline together set a new floor for what an open team can reproduce. Every reasoning-fine-tuned open model since (Sky-T1, Open-R1, Bespoke-Stratos) traces its lineage to this release. The headline lesson for Part II: an open team published a frontier-grade reasoning model and the entire training recipe within the same week.

Quantization for Serving

Big Picture

This section is a practical companion to the quantization theory in Chapter 10. It focuses on the serving-specific workflow: how to load pre-quantized models in vLLM, TGI, and SGLang; how to convert models to GGUF for llama.cpp; and how to choose the right format for your deployment target.

Prerequisites

This section assumes the quantization formats covered in Section 10.1 through Section 10.3 and the inference-stack platforms in Section 10.6. The open-versus-closed licensing landscape is revisited later in the book.

See Also

For the mathematics of quantization (absmax, zero-point, per-group schemes), the algorithms behind GPTQ, AWQ, and bitsandbytes, calibration strategies, quality degradation analysis, and hands-on quantization labs, see Section 9.1: Model Quantization. This section assumes you have already quantized your model (or downloaded a pre-quantized checkpoint) and focuses on loading and serving it.

1. Serving GPTQ and AWQ Models

Both vLLM and TGI natively support GPTQ models. Point the server at a GPTQ-quantized model and specify the quantization format.

GPTQ vs AWQ: two recipes, same 4-bit serving target
Figure 10.10.2: GPTQ and AWQ pursue the same 4-bit target with different mechanics. GPTQ minimizes per-layer output error by sequentially rounding each weight column and using the inverse Hessian to compensate the rounding error onto not-yet-quantized columns. AWQ instead identifies which weight channels matter most (by measuring activation magnitudes on calibration data) and applies a per-channel scaling factor that gives those channels more of the available 4-bit range. Both achieve roughly the same perplexity at 4-bit; AWQ runs about 4x faster and is the pragmatic choice for production pipelines that quantize many models.

The GPTQ update rule that drives the column-by-column loop above is:

$$\delta_j = \frac{w_j - \text{quant}(w_j)}{[H^{-1}]_{jj}}, \qquad w_k \leftarrow w_k - \delta_j \cdot [H^{-1}]_{jk} \text{ for } k > j$$

Here $H = X^{T} X$ is the Hessian of the layer's squared error against calibration activations $X$. The first term rounds the current column, the second redistributes the rounding error onto remaining columns to keep the layer's output close to the FP16 reference. AWQ's complementary objective, in contrast, picks the scale that minimizes the activation-weighted quantization error:

$$s^{*} = \arg\min_{s} \mathbb{E}\big[\, \| \text{quant}(W \cdot \text{diag}(s)) \cdot \text{diag}(s)^{-1} \cdot x - W x \|^{2} \,\big]$$

The activation-weighting is what makes AWQ "activation-aware": channels that consistently produce large $x$ get larger $s$, hence more of the 4-bit dynamic range.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

# End-to-end GPTQ quantization of Llama-3.1-8B to 4 bits with C4 calibration.
# AutoGPTQ runs the per-column Hessian compensation; transformers wraps the loop.
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)

gptq_cfg = GPTQConfig(
    bits=4,                # target precision
    group_size=128,        # per-128-weight scale, standard sweet spot
    desc_act=True,         # process columns in activation order for quality
    dataset="c4",          # 128 samples of WebText-style English by default
    tokenizer=tok,
)

# This call triggers the actual quantization (~10-30 min on a single A100).
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_cfg,
    device_map="auto",
)
model.save_pretrained("./llama-3.1-8b-gptq-int4")
tok.save_pretrained("./llama-3.1-8b-gptq-int4")
# The saved directory now contains INT4 weights + FP16 scales + zero-points,
# ~5.5 GB total (vs ~16 GB FP16). vLLM / TGI can load it directly:
#   vllm serve ./llama-3.1-8b-gptq-int4 --quantization gptq

Code Fragment 10.10.2: Quantizing Llama-3.1-8B to 4 bits with AutoGPTQ via Hugging Face Transformers. The desc_act=True flag processes columns in descending order of activation magnitude, which lets the Hessian compensation absorb more error in the earlier-quantized columns. The result is a single self-contained checkpoint that vLLM, TGI, and SGLang can serve with a one-flag change.

Numeric Example
GPTQ vs AWQ vs FP16 on Llama-3.1-8B (A100, WikiText-2)

A side-by-side run on a single A100, calibration = 128 C4 samples, generation = greedy, batch = 1:

  • FP16 baseline: 6.14 perplexity, 42.3 tokens/sec, 16.1 GB VRAM.
  • GPTQ 4-bit (group 128, desc_act): 6.41 perplexity (+0.27, +4.4%), 95.7 tokens/sec (+126% throughput), 4.8 GB VRAM (3.4x smaller).
  • AWQ 4-bit (GEMM kernel): 6.38 perplexity (+0.24, +3.9%), 102.3 tokens/sec (+142% throughput), 4.5 GB VRAM (3.6x smaller).

For a 70B model the same recipes typically lose less than 2% perplexity at 4-bit; the quality gap shrinks with scale because larger models have more redundancy to absorb the quantization noise. The throughput win comes from cutting weight-memory traffic by 4x; LLM decode is memory-bandwidth-bound (see Section 9.3), so 4x fewer bytes to read translates directly into roughly 2x more tokens per second.

# Serve a GPTQ model with vLLM
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# Serve a GPTQ model with TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantize gptq
Code Fragment 10.10.3: Serve a GPTQ model with vLLM
Tip

AWQ with the GEMM kernel version is generally recommended for serving scenarios where batch sizes exceed 1. The GEMV version is optimized for single-request (batch size 1) latency. If your workload mixes both, GEMM is the safer default. See Section 9.1 for the full AWQ quantization procedure.

2. GGUF and llama.cpp: CPU-Friendly Serving

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the popular C/C++ inference engine that runs LLMs on CPUs, Apple Silicon, and consumer GPUs. Unlike GPTQ and AWQ, which target NVIDIA GPUs and CUDA kernels, GGUF is designed for portability. It supports a wide range of quantization levels and can split computation between CPU and GPU.

GGUF supports quantization types ranging from Q2_K (roughly 2.5 bits per weight) to Q8_0 (8 bits per weight). The naming convention indicates the bit width and the quantization scheme. The most commonly used levels for serving are Q4_K_M and Q5_K_M, which balance quality and size.

GGUF Type Bits/Weight Model Size (7B) Quality Impact
Q2_K~2.5~2.8 GBNoticeable degradation; useful for experimentation only
Q3_K_M~3.5~3.5 GBModerate degradation; acceptable for simple tasks
Q4_K_M~4.5~4.4 GBMinimal degradation; recommended for most use cases
Q5_K_M~5.5~5.1 GBNear-lossless; best quality-to-size ratio
Q6_K~6.5~5.9 GBVery close to FP16 quality
Q8_08.0~7.2 GBVirtually lossless
Figure 10.10.3: GGUF quantization levels for a 7B parameter model. Q4_K_M offers the best balance of size and quality for most serving scenarios.

To convert a Hugging Face model to GGUF format, use the conversion script included with llama.cpp.

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build (with CUDA support for GPU offloading)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Convert a HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    /path/to/meta-llama/Llama-3.1-8B-Instruct \
    --outfile llama-3.1-8b.gguf \
    --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize llama-3.1-8b.gguf llama-3.1-8b-Q4_K_M.gguf Q4_K_M
Code Fragment 10.10.4: Clone llama.cpp

2.1 Running a GGUF Server

llama.cpp includes a built-in HTTP server with an OpenAI-compatible API. This makes it a viable serving option for environments without NVIDIA GPUs.

# Start the llama.cpp server with GPU offloading
./build/bin/llama-server \
    -m llama-3.1-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \
    -c 4096 \
    --threads 8
Code Fragment 10.10.5: Start the llama.cpp server with GPU offloading

The -ngl 99 flag offloads all layers to GPU. On systems without a GPU, omit this flag and llama.cpp will run entirely on CPU, using SIMD optimizations (AVX2, AVX-512, ARM NEON) for acceptable performance on modern processors.

3. bitsandbytes for Serving

See Also

For the bitsandbytes NF4 data type, double quantization, and complete code examples, see Section 9.1: Model Quantization (subsection 5).

TGI supports bitsandbytes quantization at startup with --quantize bitsandbytes-nf4, quantizing on the fly without a pre-quantized checkpoint. This is convenient for prototyping but produces 30% to 50% lower throughput than pre-quantized GPTQ or AWQ models due to the lack of optimized CUDA kernels. For production serving, prefer GPTQ or AWQ.

4. Comparison: Choosing the Right Format for Serving

The choice of quantization method depends on your deployment environment, performance requirements, and workflow. The following table provides a side-by-side comparison.

Table 10.10.2: Quantization Methods for LLM Serving (as of 2026).
Aspect GPTQ AWQ GGUF bitsandbytes
Precision4-bit, 8-bit4-bit2-bit to 8-bit4-bit, 8-bit
Calibration requiredYes (128+ samples)Yes (small dataset)NoNo
Quantization speed (Sec 9.1.2)Slow (hours for 70B)Moderate (30-60 min)Fast (minutes)Instant (on load)
Serving enginesvLLM, TGI, SGLangvLLM, TGI, SGLangllama.cpp, OllamaTGI, HF Transformers
HardwareNVIDIA GPU onlyNVIDIA GPU onlyCPU, Apple Silicon, GPUNVIDIA GPU only
ThroughputHighHighModerate (GPU), Low (CPU)Moderate
Quality (4-bit)Very goodVery goodGood (Q4_K_M)Good
Best forGPU serving at scaleGPU serving at scaleCPU/edge deploymentPrototyping, fine-tuning
Real-World Scenario
Maintaining AWQ and GGUF Copies of One Model

A common production pattern is to maintain two quantized versions of the same model: an AWQ version for your GPU-based serving cluster (using vLLM or TGI) and a Q4_K_M GGUF version for local developer testing (using llama.cpp or Ollama). The quality difference between these formats at 4-bit is typically less than 1% on standard benchmarks, ensuring consistent behavior between development and production.

Summary

Quantization is an essential technique for making LLM serving practical and affordable. GPTQ and AWQ provide the highest throughput on NVIDIA GPUs through optimized CUDA kernels, making them ideal for production serving with vLLM and TGI. GGUF and llama.cpp open the door to CPU and edge deployment with remarkable efficiency. bitsandbytes offers the lowest barrier to entry for experimentation. In the next section, we examine how to scale these serving solutions horizontally and balance load across multiple GPU instances.

What's Next?

In the next section, Section 10.11: External Reading & Communities, we build on the material covered here.

Further Reading

Foundational LLMs

Biderman, S., Schoelkopf, H., Anthony, Q. G., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models." ICML 2023. arXiv:2304.01373. The most-used model suite for interpretability research.
Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Reference open-weight model with widely-available interpretability hooks.
Dubey, A., Jauhri, A., Pandey, A., et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. The detailed Llama 3 technical report covering architecture, data, and training; the canonical reference for the current open-weight frontier.
Groeneveld, D., Beltagy, I., Walsh, P., et al. (2024). "OLMo: Accelerating the Science of Language Models." ACL 2024. arXiv:2402.00838. Fully-open LLM with intermediate checkpoints; reference for training-dynamics interpretability.

Closed-Frontier Reports

OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774. The reference technical report for the closed-API frontier era; documents capability evaluations even when architecture details are withheld.
Gemini Team, Google (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805. Google's flagship closed-frontier multimodal model; the canonical citation for comparing Gemini against GPT-4 and Claude.

Quantization for Serving

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. arXiv:2208.07339. The reference paper for int8 quantization with outlier-aware decomposition; the basis of bitsandbytes 8-bit serving.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers." ICLR 2023. arXiv:2210.17323. Reference for 4-bit post-training quantization; the underlying algorithm behind most production GGUF and AWQ inference stacks.