Section F.2: Models & Architectures

About This Section

This section covers specific models, model families, neural network components, and architectural patterns. Each entry links to the chapter where the architecture is explained in depth.

Activation Function: A nonlinear function (such as ReLU, GELU, or SiLU) applied after a linear transformation in a neural network layer. Without activation functions, stacking layers would be equivalent to a single linear transformation, so they are what give networks the capacity to learn complex patterns.; See Section 00.2 (Neural Network Building Blocks)
Autoregressive Model: A model that generates output one token at a time, conditioning each new token on all previously generated tokens. GPT-family models and most modern LLMs use autoregressive decoding. This left-to-right generation contrasts with masked models like BERT that see the full input at once.; See Section 05.1 (Autoregressive Generation)
BERT (Bidirectional Encoder Representations from Transformers): A pretrained encoder-only transformer from Google (2018) that processes text bidirectionally using masked language modeling. BERT excels at classification, NER, and understanding tasks where the model needs to see the full input before making a decision.; See Section 07.1 (Model Families)
Causal Language Model: A model trained with a next-token prediction objective and a causal attention mask that prevents attending to future positions. This is the standard architecture behind GPT-style models: the model can only "look left" when predicting each token.; See Section 06.1 (Pretraining Objectives)
Encoder-Decoder Architecture: A transformer design with separate encoder and decoder stacks: the encoder processes the full input, and the decoder generates output while attending to the encoder's representations. T5 and BART are encoder-decoder models, well suited for translation and summarization.; See Section 04.3 (Encoder-Decoder vs. Decoder-Only)
Feed-Forward Network (FFN): A two-layer neural network applied independently to each token position after the attention step in a transformer layer. Research suggests the FFN is where the model stores much of its factual knowledge, acting as a key-value memory.; See Section 04.2 (Transformer Components)
GELU (Gaussian Error Linear Unit): A smooth activation function that gates inputs based on their magnitude, serving as the default nonlinearity in GPT and BERT families. It outperforms ReLU for language tasks because it preserves gradient flow for small negative inputs.; See Section 04.2 (Transformer Components)
GPT (Generative Pre-trained Transformer): A family of autoregressive, decoder-only transformers from OpenAI. From GPT-1 (2018) through GPT-4 and beyond, this series demonstrated that scaling language models yields increasingly capable general-purpose AI systems, establishing the dominant paradigm for modern LLMs.; See Section 07.1 (Model Families)
Grouped Query Attention (GQA): An attention variant that shares key-value heads across groups of query heads, reducing KV cache size while preserving most of multi-head attention quality. GQA is used in Llama 2, Llama 3, and Mistral to make long-context inference practical.; See Section 04.1 (Self-Attention Mechanisms)
Layer Normalization: A technique that normalizes activations across the feature dimension of each token, keeping values in a consistent range to stabilize training. Most modern transformers use pre-layer normalization (applying norm before attention) with the more efficient RMSNorm variant.; See Section 04.2 (Transformer Components)
LLM (Large Language Model): A neural network with billions of parameters trained on vast text corpora using self-supervised learning. LLMs exhibit emergent capabilities such as in-context learning, reasoning, and instruction following. Prominent examples include GPT-4, Claude, Gemini, and Llama.; See Chapter 07 (The Modern LLM Landscape)
Mixture of Experts (MoE): An architecture that replaces a single feed-forward layer with multiple "expert" sub-networks and a gating mechanism that routes each token to a small subset. MoE scales total parameter count while keeping per-token inference cost low. Mixtral and DeepSeek-V3 use this approach.; See Section 07.3 (Open-Source Models)
Multi-Head Attention (MHA): An attention mechanism that runs multiple parallel attention computations (heads), each with its own learned projections, then concatenates the results. Multiple heads let the model simultaneously attend to different types of relationships (syntax, semantics, coreference).; See Section 04.1 (Self-Attention Mechanisms)
Positional Encoding: A mechanism that injects sequence order information into transformer inputs, since self-attention treats tokens as an unordered set. Methods range from the original sinusoidal encodings to modern approaches like RoPE (Rotary Position Embedding).; See Section 04.4 (Positional Encodings)
Reasoning Model: A model trained to perform extended step-by-step reasoning, often using a "thinking" phase before producing a final answer. Examples include OpenAI's o1/o3 and DeepSeek-R1. These models trade additional inference compute for substantially improved performance on math, science, and logic tasks.; See Section 07.4 (Reasoning and Frontier Models)
RMSNorm (Root Mean Square Normalization): A simplified normalization that scales activations by their root mean square, omitting the mean-centering step of standard layer norm. RMSNorm is used in Llama, Mistral, and many modern architectures because it is faster while performing equally well.; See Section 04.2 (Transformer Components)
RoPE (Rotary Position Embedding): A position encoding method that applies rotation matrices to query and key vectors, encoding relative positions through rotation angles. RoPE naturally decays attention between distant tokens and supports context length extension techniques like YaRN.; See Section 04.4 (Positional Encodings)
Softmax: A function that converts a vector of raw scores (logits) into a probability distribution where each value is between 0 and 1 and all values sum to 1. Softmax is used both in the attention mechanism (to produce attention weights) and as the final layer that produces next-token probabilities.; See Section 04.1 (Self-Attention Mechanisms)
SwiGLU: A gated activation function combining Swish with a gating linear unit. SwiGLU consistently outperforms GELU and ReLU in transformer FFN layers and is the default choice in Llama, Mistral, and Gemma model families.; See Section 04.2 (Transformer Components)
Transformer: The neural network architecture introduced in the 2017 paper "Attention Is All You Need." Transformers use self-attention and feed-forward layers to process sequences in parallel, replacing the sequential computation of RNNs. Virtually all modern LLMs are built on the transformer.; See Chapter 04 (The Transformer Architecture)