Section 10.5: Model Pruning and Sparsity

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."
Pip, Minimalist AI Agent

Big Picture

You have a 70-billion-parameter model that performs beautifully on your task, but it requires four A100 GPUs to serve and costs $0.03 per request. Your budget demands single-GPU deployment at under $0.005 per request. Quantization alone (Section 9.1) gets you partway there by shrinking each weight from 16 bits to 4 bits. Pruning gets you the rest of the way by eliminating weights entirely. While Section 9.5 introduced the theory behind pruning and sparsity, this section is about doing it: choosing the right pruning method for your deployment target, running the tools, validating quality, and combining pruning with quantization and distillation to hit aggressive cost and latency targets. By the end, you will have a practical playbook for shipping sparse models in production.

Prerequisites

This section builds on the quantization techniques from Section 9.1: Model Quantization and the inference optimization concepts in Section 9.4: Serving Infrastructure. Familiarity with transformer weight matrices from Section 4.2 will help you understand where pruning operates. The theoretical foundations of pruning are covered in Section 9.5; this section focuses on practical application, tooling, and deployment workflows.

1. Why Pruning Matters for LLM Deployment

Large language models are expensive to serve because they are large. A 70B parameter model at FP16 precision occupies roughly 140 GB of GPU memory, requiring multi-GPU setups just to load the weights. Every forward pass multiplies input activations through these weight matrices, so the number of nonzero weights directly determines both memory footprint and compute cost. Pruning attacks this problem at the source: it sets a fraction of weights to zero, reducing the effective model size without changing the architecture.

The practical appeal of pruning is that it operates on a fully trained model. Unlike distillation, which requires training a new smaller model from scratch, pruning preserves the original model's architecture and most of its learned representations. This makes it particularly attractive when you have a fine-tuned model (Section 14.1) that performs well on your task and you need to compress it for deployment without retraining.

The sparsity opportunity. Research consistently shows that 50% to 80% of weights in large transformers can be removed with minimal quality degradation. The reason is that pretraining creates heavily over-parameterized models, where many weights encode redundant or rarely used knowledge. For a specific deployment task (say, customer service chat), the fraction of truly essential weights is even smaller. This means that task-specific pruning can often achieve higher sparsity ratios than general-purpose pruning while maintaining quality on the target domain.

Fun Note

The "lottery ticket hypothesis" (Frankle and Carlin, 2019) suggests that inside every large neural network is a much smaller network that could have been trained from scratch to the same accuracy. Pruning is essentially searching for that winning ticket after the fact. The irony: we spend millions training a massive model, only to discover that most of it was unnecessary all along.

Pruning also complements the API cost optimization strategies from Section 10.3. If you self-host a pruned model behind your own API endpoint using frameworks like vLLM or TGI (Section 9.4), you reduce per-request compute costs directly, potentially making self-hosting competitive with commercial API pricing for high-volume workloads.

Common Mistake: Confusing Sparsity with Speedup

A model with 50% of its weights set to zero is not automatically 50% faster. Unstructured sparsity (random zeros scattered throughout the weight matrices) provides no speedup on standard GPU hardware because dense matrix multiplication kernels cannot skip individual zeros. To translate sparsity into wall-clock speedups, you need either (1) structured sparsity patterns such as NVIDIA's 2:4 format that have hardware support, or (2) specialized sparse inference engines. Many teams invest effort in pruning, celebrate the "50% sparse" metric, then discover zero latency improvement at deployment time. Verify speedup on your target hardware before committing to a pruning strategy.

2. Unstructured Pruning

Unstructured pruning removes individual weights anywhere in the model's weight matrices. The resulting sparse matrices have zeros scattered throughout, with no requirement that the zeros form regular patterns. This flexibility allows unstructured pruning to achieve very high sparsity (often 70% to 90%) because the algorithm can selectively remove whichever weights matter least, regardless of their position.

2.1 Magnitude Pruning

The simplest and oldest pruning strategy is magnitude pruning: sort all weights by their absolute value and set the smallest ones to zero. The intuition is straightforward: weights near zero contribute little to the output of any matrix multiplication. Despite its simplicity, magnitude pruning remains a competitive baseline for moderate sparsity levels (up to 50%).

# Magnitude pruning with PyTorch: zero out the smallest 50% of weights
# in every Linear layer, then make the pruning permanent.
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Apply magnitude pruning to all Linear layers
for name, module in model.named_modules():
 if isinstance(module, torch.nn.Linear):
 prune.l1_unstructured(module, name="weight", amount=0.5) # 50% sparsity

# Check sparsity of a specific layer
layer = model.model.layers[0].self_attn.q_proj
zeros = (layer.weight == 0).sum().item()
total = layer.weight.numel()
print(f"Sparsity: {zeros / total:.1%}") # Should be ~50.0%

# Make pruning permanent (remove the reparameterization)
for name, module in model.named_modules():
 if isinstance(module, torch.nn.Linear):
 prune.remove(module, "weight")

Sparsity: 50.0%

Code Fragment 10.5.1: Magnitude pruning with PyTorch: zero out the smallest 50% of weights

The limitation of magnitude pruning is that it ignores context. A weight with small absolute value might still be critical if it sits on a high-activation pathway. This is why more sophisticated methods have emerged.

2.2 Movement Pruning

Movement pruning (Sanh et al., 2020) takes a different approach: instead of looking at weight magnitudes, it tracks which weights are moving toward zero during fine-tuning. Weights that the optimization process is "trying to eliminate" are pruned, while weights that are growing in magnitude are retained. This produces better results than magnitude pruning when combined with task-specific fine-tuning, because the pruning decision aligns with the model's adaptation to the target task.

2.3 SparseGPT

SparseGPT (Frantar & Alistarh, 2023) was a breakthrough for LLM pruning because it can prune massive models in a single pass without any retraining. The algorithm works layer by layer, solving an optimization problem that minimizes the reconstruction error introduced by pruning. For each layer, SparseGPT computes the optimal set of weights to remove and adjusts the remaining weights to compensate for the removed ones. The key insight is that this layer-wise approach scales linearly with model size, making it feasible to prune models with hundreds of billions of parameters in hours rather than weeks.

# Using SparseGPT via the SparseML library
# Install: pip install sparseml
from sparseml.transformers import SparseAutoModelForCausalLM
from sparseml.transformers import oneshot

# One-shot pruning with SparseGPT
oneshot(
 model="meta-llama/Llama-3.2-1B",
 dataset="open_platypus",
 recipe="recipe.yaml", # Specifies 50% unstructured sparsity
 output_dir="./pruned-model",
 num_calibration_samples=512,
)

# The recipe.yaml would contain:
# sparsity_modifiers:
# SparseGPTModifier:
# sparsity: 0.5
# sequential_update: true
# targets: ["re:model.layers.\\d+.self_attn", "re:model.layers.\\d+.mlp"]

weights: [ 0.01 -0.8 0.3 -0.05] activation norms: [50. 0.1 2. 40. ] Wanda scores: [0.5 0.08 0.6 2. ]

Code Fragment 10.5.2: Using SparseGPT via the SparseML library

Key Insight

SparseGPT and Wanda represent a paradigm shift in pruning: they require no retraining at all. Traditional pruning methods followed a prune-then-retrain cycle that was prohibitively expensive for billion-parameter models. These one-shot methods need only a small calibration dataset (typically 128 to 512 samples) to guide their pruning decisions. This means you can prune a 70B model on a single GPU in a few hours, making pruning accessible to teams without massive compute budgets. The tradeoff is that one-shot methods typically achieve slightly lower quality than prune-and-retrain approaches at the same sparsity level, but the practical accessibility usually outweighs this gap.

3. Structured Pruning

Structured pruning removes entire structural components of the model: full rows or columns from weight matrices, entire attention heads, or even whole transformer layers. Unlike unstructured pruning, which produces irregular sparsity patterns that require specialized sparse matrix libraries, structured pruning produces a genuinely smaller dense model that runs efficiently on standard hardware with no special kernel support.

3.1 Attention Head Pruning

Multi-head attention (Section 4.2) distributes computation across multiple parallel attention heads. Research has shown that many of these heads are redundant: removing 20% to 40% of attention heads often has minimal impact on downstream performance. Michel et al. (2019) demonstrated that in some BERT models, a single attention head per layer suffices for most tasks.

For LLMs, head pruning is typically guided by an importance score computed over a calibration dataset. The score measures how much each head's output contributes to the model's final predictions. Heads with the lowest importance scores are removed, and the model architecture is updated accordingly.

# Gradient-based attention head importance scoring for structured pruning.
# Accumulates gradient magnitudes across Q, K, V projections per head
# to produce a per-layer importance matrix guiding removal decisions.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Compute head importance via gradient-based scoring
def compute_head_importance(model, dataloader, num_heads=32, num_layers=16):
 """Score each attention head by its gradient-weighted contribution."""
 head_importance = torch.zeros(num_layers, num_heads)

 model.eval()
 for batch in dataloader:
 outputs = model(**batch, output_attentions=True)
 loss = outputs.loss
 loss.backward()

 for layer_idx in range(num_layers):
 attn_module = model.model.layers[layer_idx].self_attn
 # Accumulate gradient magnitudes for Q, K, V projections
 for proj in [attn_module.q_proj, attn_module.k_proj, attn_module.v_proj]:
 grad = proj.weight.grad
 if grad is not None:
 # Reshape to (num_heads, head_dim, input_dim) and sum
 head_dim = grad.shape[0] // num_heads
 grad_per_head = grad.view(num_heads, head_dim, -1)
 head_importance[layer_idx] += grad_per_head.abs().sum(dim=(1, 2))

 model.zero_grad()

 return head_importance / len(dataloader)

# After computing importance, prune the lowest-scoring heads
# by zeroing their Q, K, V, and O projection rows/columns

Code Fragment 10.5.3: Gradient-based attention head importance scoring for structured pruning. The function accumulates gradient magnitudes across Q, K, and V projections for each head, producing a per-layer importance matrix that guides which heads to remove with minimal impact on model quality.

3.2 Layer Pruning

Layer pruning removes entire transformer layers from the model stack. This is the most aggressive form of structured pruning: removing a layer eliminates all its parameters (self-attention, feed-forward network, layer norms) in one operation. A 32-layer model pruned to 24 layers is 25% smaller and 25% faster, with no sparse kernel overhead.

The challenge is selecting which layers to remove. Shallow layers (close to the input) tend to encode syntactic features, while deep layers (close to the output) encode more abstract, task-specific representations. Middle layers are often the most redundant. Studies on LLaMA models have shown that removing 8 out of 32 layers from the middle of the stack preserves over 90% of benchmark performance, while removing layers from the first or last quarter causes severe degradation.

3.3 Width Pruning

Width pruning reduces the hidden dimension of the model by removing entire neurons (columns) from feed-forward layers or reducing the embedding dimension. This produces a uniformly narrower model. Width pruning is less common than head or layer pruning for LLMs because it requires careful handling of residual connections, but it can be effective when combined with knowledge distillation to recover lost quality.

Key Insight

The fundamental tradeoff between unstructured and structured pruning is sparsity ratio versus hardware efficiency. Unstructured pruning at 70% sparsity removes more parameters than structured pruning at 30%, but the remaining parameters form an irregular pattern that standard GPUs cannot exploit efficiently. Structured pruning at 30% produces a genuinely smaller model that achieves the full theoretical speedup on any hardware. For deployment on commodity hardware without sparse kernel support, structured pruning often delivers better real-world speedups despite removing fewer total parameters. For deployment on NVIDIA Ampere or Hopper GPUs with sparse tensor core support, the 2:4 semi-structured sparsity pattern offers a compelling middle ground.

4. The Wanda Method

Wanda (Sun et al., 2024), which stands for Pruning by Weights and Activations, introduced an elegantly simple idea: instead of using only weight magnitudes to decide what to prune, multiply each weight by the corresponding input activation norm. A weight might be small in absolute terms, but if it sits on a pathway with large activations, removing it would significantly change the layer's output. Conversely, a large weight on a pathway with near-zero activations contributes nothing and can be safely pruned.

The Wanda importance score for weight w_ij in a linear layer is:

score($w_{ij}$) = |$w_{ij}$| · ||$X_{j}$||₂

where X_j is the j-th column of the input activation matrix collected over the calibration set. This is computed once per layer using a small calibration dataset (128 examples typically suffice), making Wanda nearly as fast as simple magnitude pruning while achieving quality comparable to the much more expensive SparseGPT.

A small numeric example shows how activation context changes pruning decisions:

# Wanda score: numeric walkthrough for 4 weights
import numpy as np

weights = np.array([0.01, -0.8, 0.3, -0.05])
activation_norms = np.array([50.0, 0.1, 2.0, 40.0])

scores = np.abs(weights) * activation_norms
print("weights: ", weights)
print("activation norms:", activation_norms)
print("Wanda scores: ", scores)
# Wanda scores: [0.5, 0.08, 0.6, 2.0]
# Magnitude pruning would drop w0 (0.01) and w3 (-0.05) first.
# Wanda drops w1 (-0.8) instead because its activation norm is tiny.

Code Fragment 10.5.4: Wanda score: numeric walkthrough for 4 weights

Code Fragment 10.5.9 implements Wanda pruning from scratch, showing the importance computation and threshold-based masking for a single linear layer.

# Wanda pruning implementation sketch
# Reference: https://github.com/locuslab/wanda

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def wanda_prune_layer(weight, activation_norms, sparsity_ratio=0.5):
 """
 Prune a linear layer using the Wanda criterion.

 Args:
 weight: (out_features, in_features) weight matrix
 activation_norms: (in_features,) L2 norm of input activations
 sparsity_ratio: fraction of weights to prune
 """
 # Compute importance: |weight| * activation_norm per input feature
 importance = weight.abs() * activation_norms.unsqueeze(0)

 # Find the threshold for the desired sparsity
 num_prune = int(sparsity_ratio * weight.numel())
 threshold = torch.kthvalue(importance.flatten(), num_prune).values

 # Create mask: keep weights above threshold
 mask = importance > threshold
 return weight * mask.float(), mask

# Collect activation norms over calibration data
def collect_activation_norms(model, calibration_loader, layer_idx):
 """Hook into a specific layer to collect input activation norms."""
 norms = []

 def hook_fn(module, input, output):
 # input[0] shape: (batch, seq_len, hidden_dim)
 norms.append(input[0].float().norm(dim=(0, 1)))

 layer = model.model.layers[layer_idx].mlp.gate_proj
 handle = layer.register_forward_hook(hook_fn)

 with torch.no_grad():
 for batch in calibration_loader:
 model(**batch)

 handle.remove()
 return torch.stack(norms).mean(dim=0)

Original size: 1,801,350 examples Sample: Homarus gammarus , known as the European lobster or common lobster , is a species of clawed lobster from the... After quality filter: 670,841 examples Removed: 1,130,509 (62.8%)

Code Fragment 10.5.5: Wanda pruning implementation sketch

For production use, the llmcompressor library wraps Wanda (and SparseGPT) into a single high-level call:

# Library shortcut: Wanda pruning via llmcompressor (pip install llmcompressor)
from llmcompressor.modifiers.pruning import WandaPruningModifier
from llmcompressor import oneshot

recipe = WandaPruningModifier(sparsity=0.5, block_size=128)
oneshot(
 model="meta-llama/Llama-3.2-1B",
 recipe=recipe,
 dataset="ultrachat-200k",
 output_dir="./llama-1b-wanda-50",
)

Code Fragment 10.5.6: Library shortcut: Wanda pruning via llmcompressor. The entire from-scratch implementation above collapses into a single oneshot call. The library handles calibration data loading, activation collection, and per-layer threshold computation internally.

Real-World Scenario: Pruning a Llama 3 Model for Single-GPU Deployment

Who: Tomás, an MLOps engineer at an e-commerce company.

Situation: He had fine-tuned Llama-3.1-8B for a customer support chatbot and needed to deploy it on a single L4 GPU (24 GB VRAM) to keep serving costs within the team's $2,000/month infrastructure budget.

Problem: At FP16, the model required approximately 16 GB just for weights, leaving minimal room for the KV cache (Section 9.2) during inference. Under concurrent user load, the GPU would run out of memory after just 3 simultaneous conversations.

Decision: Tomás applied Wanda pruning at 50% unstructured sparsity, then quantized the pruned model to INT4 (Section 9.1), reducing the effective model size to approximately 4 GB and freeing 20 GB for the KV cache, batch processing, and operating system overhead.

Result: On the company's customer support evaluation benchmark, the pruned and quantized model retained 94% of the original's quality score while reducing per-request latency by 40% and doubling maximum throughput. The single L4 GPU now handled 12 concurrent conversations comfortably.

Lesson: Combining pruning with quantization delivers compounding memory savings that make single-GPU deployment feasible for models that would otherwise require multi-GPU setups. Always benchmark quality after compression to verify the tradeoff is acceptable for your use case.

5. Sparse Model Inference

Pruning a model is only half the battle. To realize actual speedups, the inference engine must exploit the sparsity pattern. A sparse weight matrix stored in dense format takes exactly the same amount of memory and compute as the original. Three approaches exist for translating sparsity into real performance gains.

5.1 Sparse Matrix Formats and Kernels

Unstructured sparse matrices are stored in compressed formats like CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column) that skip zero values entirely. Libraries like torch.sparse and specialized CUDA kernels can perform matrix multiplications using these formats. However, on GPUs, the overhead of indirect indexing in sparse formats often negates the benefit of skipping zero multiplications, unless sparsity exceeds 90% to 95%. This is why unstructured sparsity at moderate levels (50% to 80%) frequently fails to deliver wall-clock speedups on current hardware, despite reducing the theoretical compute by half or more.

5.2 NVIDIA Sparse Tensor Cores and 2:4 Sparsity

NVIDIA's Ampere (A100), Ada Lovelace (L40, RTX 4090), and Hopper (H100) GPU architectures include sparse tensor cores that natively accelerate a specific sparsity pattern: 2:4 structured sparsity. In this pattern, exactly 2 out of every 4 consecutive weights must be zero. The hardware stores only the 2 nonzero values plus a small index, achieving a genuine 2x speedup for matrix multiplications with no software overhead.

# Convert all Linear layers to NVIDIA 2:4 semi-structured sparsity.
# In each group of 4 weights, the 2 smallest are zeroed for hardware acceleration.
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor

# Enable the fast path for semi-structured sparsity
SparseSemiStructuredTensor._FORCE_CUTLASS = True

model = AutoModelForCausalLM.from_pretrained(
 "meta-llama/Llama-3.2-1B",
 torch_dtype=torch.float16,
 device_map="cuda"
)

# Apply 2:4 sparsity pattern to all linear layers
for name, module in model.named_modules():
 if isinstance(module, torch.nn.Linear):
 # Prune to 2:4 pattern (keep 2 largest per group of 4)
 weight = module.weight.data
 # Reshape into groups of 4 along the input dimension
 w_groups = weight.view(-1, weight.shape[1] // 4, 4)
 # Find the 2 smallest values in each group
 _, indices = w_groups.abs().topk(2, dim=-1, largest=False)
 mask = torch.ones_like(w_groups, dtype=torch.bool)
 mask.scatter_(-1, indices, False)
 weight.view_as(w_groups).masked_fill_(~mask, 0)

 # Convert to semi-structured sparse format for hardware acceleration
 module.weight = torch.nn.Parameter(
 to_sparse_semi_structured(module.weight.data)
 )

Code Fragment 10.5.7: Converting a model to NVIDIA 2:4 semi-structured sparsity. For each group of 4 consecutive weights, the two smallest are zeroed, then to_sparse_semi_structured converts the result into a hardware-accelerated format that delivers a genuine 2x speedup on Ampere and Hopper GPUs.

The 2:4 pattern is particularly powerful because it is the only sparsity pattern that delivers guaranteed hardware speedups across all batch sizes and sequence lengths on supported GPUs. For teams deploying on NVIDIA hardware, 2:4 sparsity is often the most practical pruning strategy.

5.3 DeepSparse and Specialized Runtimes

For CPU deployment, Neural Magic's DeepSparse runtime is specifically designed to exploit unstructured sparsity. It converts sparse weight matrices into an optimized execution plan that skips zero computations on CPUs, where the memory access patterns make sparse computation more favorable than on GPUs. DeepSparse can deliver near-GPU throughput for highly sparse models (90%+) running on commodity CPUs, making it an attractive option for teams without GPU infrastructure.

6. Combining Pruning with Quantization and Distillation

The three compression techniques, pruning, quantization, and distillation, are largely orthogonal and can be stacked for aggressive compression. The order in which you apply them matters significantly.

6.1 The Recommended Pipeline

Step 1: Prune first. Apply SparseGPT or Wanda to achieve the target sparsity. Pruning works best on full-precision weights where the importance signal is cleanest.

Step 2: Quantize the sparse model. After pruning, quantize the remaining nonzero weights to INT4 or INT8 using GPTQ or AWQ (Section 9.1). The calibration data for quantization should reflect the pruned model's behavior, not the original dense model's.

Step 3: Optionally, distill. If quality after pruning and quantization is insufficient, use the original dense model as a teacher to fine-tune the compressed student. Even a few hundred steps of distillation can recover 2 to 5 percentage points on benchmarks.

Figure 10.5.1: The three-stage compression pipeline. Each stage is orthogonal, and the order (prune, quantize, distill) matters for best results.

6.2 Sparse Quantized Formats

The combination of 2:4 sparsity and INT8 quantization is particularly well supported on NVIDIA hardware, achieving a combined 4x speedup (2x from sparsity times 2x from quantization) with a single fused kernel. For even more aggressive compression, some teams apply 2:4 sparsity at INT4 precision, though this requires careful quality validation.

Key Insight

When combining pruning with quantization, the order matters because each technique distorts the weight distribution. Pruning removes weights that the quantization calibration step would otherwise factor into its range and scale calculations. If you quantize first and then prune, the quantization parameters are computed on a distribution that includes weights that will later be zeroed, leading to suboptimal quantization of the surviving weights. Pruning first allows quantization to adapt its parameters to the sparse model's actual weight distribution, preserving more quality. Think of it like tailoring a suit: you cut the fabric (prune) before hemming the edges (quantize), not the other way around.

7. Practical Guidelines

Choosing the right pruning approach depends on your deployment hardware, latency requirements, and acceptable quality loss. The following decision framework covers the most common scenarios.

7.1 Which Method When

7.1 Which Method When Comparison

Scenario	Recommended Method	Target Sparsity	Expected Quality
NVIDIA Ampere/Hopper GPU	2:4 semi-structured (Wanda or SparseGPT)	50% (fixed)	95% to 98% of dense
CPU deployment with DeepSparse	Unstructured (SparseGPT)	70% to 90%	90% to 95% of dense
Standard GPU, no sparse support	Structured (layer or head pruning)	20% to 40%	92% to 97% of dense
Maximum compression needed	Prune (50%) + Quantize (INT4) + Distill	50% + 4x quant	90% to 95% of dense
Task-specific deployment	Movement pruning during fine-tuning	60% to 80%	93% to 97% on target task

7.2 Sparsity Targets and Quality Tradeoffs

The 50% sweet spot. Across most methods and models, 50% sparsity is the point where you get meaningful compression with minimal quality loss. Below 50%, the compression savings are often not worth the engineering effort. Above 70%, quality degradation becomes noticeable on challenging benchmarks, though task-specific performance may still be acceptable.

Layer-wise sparsity allocation. Not all layers are equally pruneable. Embedding layers and the final language modeling head should generally be left dense (or pruned very conservatively), as they have outsized impact on model quality. Middle transformer layers tolerate higher sparsity than the first and last few layers. Tools like SparseGPT and Wanda support per-layer sparsity targets to exploit this observation.

Calibration data matters. The choice of calibration data for one-shot methods (SparseGPT, Wanda) significantly affects which weights are identified as important. Use calibration data that resembles your deployment distribution. If your application is code generation, calibrate on code samples, not Wikipedia articles. Even 128 well-chosen calibration examples can outperform 1024 poorly chosen ones.

7.3 Validation Strategy

After pruning, always validate on a held-out evaluation set that represents your production workload. Standard benchmarks (MMLU, HellaSwag) provide a general quality signal, but task-specific evaluation (Section 25.1) is essential because pruning can affect different capabilities unevenly. A model that retains 95% of its general benchmark score might have lost 20% on a specific subtask that matters for your application.

Key Takeaways

Pruning removes redundant parameters. LLMs contain significant redundancy; structured and unstructured pruning can eliminate 50% or more of weights with minimal accuracy loss.
One-shot pruning scales to billion-parameter models. Methods like SparseGPT and Wanda prune without retraining by using calibration data to guide which weights to remove.
Structured pruning delivers real speedups. Removing entire attention heads or FFN columns translates directly to faster inference, unlike unstructured sparsity which requires specialized hardware support.
Pruning and quantization are complementary. Combining sparsity with quantization (e.g., 2:4 structured sparsity with INT8) yields compounding compression and speed benefits.

Research Frontier

Activation-aware structured pruning. Recent work is bridging the gap between unstructured and structured pruning by using activation statistics to guide structured decisions. Methods like SliceGPT (Ashkboos et al., 2024) remove entire rows and columns from weight matrices based on their contribution to the activation covariance, achieving structured compression ratios previously possible only with unstructured methods.

Another active direction is combining pruning with Mixture-of-Experts (Section 7.2) architectures, where experts that are rarely activated by production traffic are pruned entirely, reducing model size without affecting the routing of commonly used experts. Hardware-aware pruning, where the pruning algorithm directly optimizes for measured latency on the target deployment hardware rather than theoretical FLOP reduction, is also gaining traction as the gap between theoretical and realized speedups remains a persistent challenge.

8. Exercises

Exercise

Conceptual

Structured vs. unstructured tradeoffs. A colleague argues that unstructured pruning at 80% sparsity is always better than structured pruning at 30% because it removes more parameters. Explain why this reasoning is flawed. In your answer, discuss: (a) the role of hardware support in translating parameter reduction into wall-clock speedups, (b) the storage format overhead of sparse matrices, and (c) a scenario where 30% structured pruning delivers better real-world latency than 80% unstructured pruning.

Exercise

Coding

Implement Wanda from scratch. Using PyTorch, implement the Wanda pruning criterion for a single nn.Linear layer. Your implementation should: (a) register a forward hook to collect input activation norms over a calibration batch, (b) compute the importance score |w| * ||x|| for each weight, (c) create a binary mask that zeros the lowest-scoring 50% of weights, and (d) verify that the pruned layer's output on the calibration data deviates by less than 5% (relative L2 error) from the original. Compare your results with simple magnitude pruning at the same sparsity level.

Exercise

Analysis

Pruning sensitivity analysis. Load a small language model (e.g., GPT-2 or Llama-3.2-1B) and compute the per-layer sensitivity to pruning. For each transformer layer, apply magnitude pruning at 50% sparsity to only that layer (keeping all other layers dense) and measure the perplexity increase on a validation set. Plot the per-layer sensitivity and identify: (a) which layers are most sensitive to pruning, (b) whether there is a pattern (early, middle, or late layers), and (c) how you would use this information to design a non-uniform sparsity schedule.

Exercise

Coding

2:4 sparsity conversion. Write a function that converts a dense weight matrix to the 2:4 sparsity pattern by selecting the two largest-magnitude weights in each group of four consecutive elements along the input dimension. Apply this function to all linear layers in a small transformer model and measure: (a) the perplexity before and after conversion, (b) the actual inference speedup using torch.sparse.to_sparse_semi_structured on a CUDA device, and (c) how the speedup varies with batch size. If you lack GPU access, simulate the pattern and report only the quality metrics.

Exercise

Analysis

Compression pipeline comparison. Take a small language model and compare three compression pipelines on both quality (perplexity) and size: (a) quantization only (INT4 via GPTQ), (b) pruning only (50% Wanda), and (c) pruning (50% Wanda) followed by quantization (INT4). For each pipeline, measure the model size on disk, the perplexity on a held-out validation set, and the memory footprint during inference. Determine which pipeline achieves the best quality-per-byte tradeoff and explain why the combined approach does or does not outperform the individual techniques.

Lab: Build a Data Curation Pipeline

Duration: ~60 minutes Intermediate

Objective

Use the HuggingFace datasets library to load a text corpus, apply quality filters, perform deduplication (first with exact hashing, then with MinHash via datasketch), and tokenize the result into a training-ready dataset.

What You'll Practice

Loading and streaming large datasets with the datasets library
Implementing quality filters (length, language, perplexity heuristics)
Exact deduplication with Python hash sets
Approximate deduplication with MinHash and LSH
Tokenizing filtered data for model training

Setup

Install the required libraries. No GPU is needed for this lab.

pip install datasets transformers datasketch

Code Fragment 10.5.8: Code example

Steps

Step 1: Load and inspect the dataset

Load a small text corpus and examine its structure, then apply basic quality filters.

from datasets import load_dataset

# Load a subset of a public corpus
ds = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
print(f"Original size: {len(ds):,} examples")
print(f"Sample: {ds[100]['text'][:200]}")

# Quality filters: remove empty lines and very short documents
def quality_filter(example):
 text = example["text"].strip()
 if len(text) < 50: # too short
 return False
 if text.startswith("="): # section headers only
 return False
 return True

filtered = ds.filter(quality_filter, num_proc=4)
print(f"After quality filter: {len(filtered):,} examples")
print(f"Removed: {len(ds) - len(filtered):,} ({(len(ds)-len(filtered))/len(ds)*100:.1f}%)")

Code Fragment 10.5.9: Load a subset of a public corpus

Hint

WikiText-103 contains many empty lines and section headers that are not useful for training. Filtering these out typically removes 60-70% of the rows while retaining the majority of actual content.

Step 2: Exact deduplication with hashing

Remove exact duplicate texts using a simple hash set, then measure how many duplicates existed.

import hashlib

seen_hashes = set()
duplicates = 0

def dedup_exact(example):
 global duplicates
 text_hash = hashlib.md5(example["text"].encode()).hexdigest()
 if text_hash in seen_hashes:
 duplicates += 1
 return False
 seen_hashes.add(text_hash)
 return True

deduped_exact = filtered.filter(dedup_exact, num_proc=1) # single-proc for hash set
print(f"After exact dedup: {len(deduped_exact):,} examples")
print(f"Exact duplicates removed: {duplicates:,}")

After exact dedup: 669,473 examples Exact duplicates removed: 1,368

Code Fragment 10.5.10: Implementation of dedup_exact

Hint

Exact deduplication catches only identical copies. Real corpora often contain near-duplicates (texts that differ by a few words), which require fuzzy matching.

Step 3: Near-duplicate detection with MinHash

Use MinHash signatures and Locality-Sensitive Hashing (LSH) from the datasketch library to find and remove near-duplicates.

from datasketch import MinHash, MinHashLSH

# Build LSH index
lsh = MinHashLSH(threshold=0.8, num_perm=128)
minhashes = {}

def compute_minhash(text, num_perm=128):
 m = MinHash(num_perm=num_perm)
 for word in text.lower().split():
 m.update(word.encode("utf-8"))
 return m

# Index a subset (full corpus would take longer)
subset = deduped_exact.select(range(min(5000, len(deduped_exact))))
near_dupes = set()

for i, example in enumerate(subset):
 mh = compute_minhash(example["text"])
 # Check for near-duplicates already in the index
 candidates = lsh.query(mh)
 if candidates:
 near_dupes.add(i)
 else:
 lsh.insert(str(i), mh)

print(f"Near-duplicates found: {len(near_dupes)} out of {len(subset)}")
print(f"Near-duplicate rate: {len(near_dupes)/len(subset)*100:.2f}%")

# Remove near-duplicates
keep_indices = [i for i in range(len(subset)) if i not in near_dupes]
clean_data = subset.select(keep_indices)
print(f"Clean dataset size: {len(clean_data):,}")

Near-duplicates found: 237 out of 5000 Near-duplicate rate: 4.74% Clean dataset size: 4,763

Code Fragment 10.5.11: Build LSH index

Hint

MinHash LSH with threshold=0.8 flags documents that share roughly 80% of their word-level shingles. Lowering the threshold catches more near-duplicates but increases false positives. Production pipelines typically use thresholds between 0.7 and 0.85.

Step 4: Tokenize for training

Tokenize the cleaned dataset and pack sequences into fixed-length chunks ready for language model training.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
block_size = 512

def tokenize_and_chunk(examples):
 tokens = tokenizer(examples["text"], truncation=False)["input_ids"]
 # Concatenate all tokens, then split into fixed-size blocks
 all_tokens = [t for doc in tokens for t in doc]
 chunks = [all_tokens[i:i+block_size]
 for i in range(0, len(all_tokens) - block_size + 1, block_size)]
 return {"input_ids": chunks}

tokenized = clean_data.map(
 tokenize_and_chunk, batched=True, remove_columns=clean_data.column_names
)
print(f"Training chunks: {len(tokenized):,} (each {block_size} tokens)")
print(f"Total tokens: {len(tokenized) * block_size:,}")

Training chunks: 4,218 (each 512 tokens) Total tokens: 2,159,616

Code Fragment 10.5.12: Implementation of tokenize_and_chunk

Hint

Concatenating all documents and then chunking into fixed blocks is the standard approach for pretraining data preparation. It avoids wasting compute on padding tokens. For fine-tuning, you would instead keep documents separate and pad to the longest in each batch.

Expected Output

Quality filtering removes 50-70% of rows (mostly empty lines and headers)
Exact dedup catches a small number of identical duplicates
MinHash near-dedup identifies additional 1-5% near-duplicate documents
Final tokenized dataset of fixed-length chunks ready for training

Stretch Goals

Add a language detection filter using langdetect or fasttext to keep only English text
Compute perplexity scores with a small model and filter documents above a threshold
Save the processed dataset to disk with dataset.save_to_disk() and reload it to verify persistence

What's Next?

Model pruning is one piece of the deployment optimization toolkit. Combined with quantization, speculative decoding, and the serving infrastructure from Section 9.4, pruning enables deployment configurations that were impractical with dense models.

In Chapter 11: Prompt Engineering, we shift from optimizing model internals to optimizing how you communicate with LLMs, covering techniques from zero-shot prompting to chain-of-thought reasoning. For teams considering fine-tuning compressed models, Chapter 15 on PEFT covers parameter-efficient techniques like LoRA that compose naturally with pruned models, and Chapter 16 on distillation provides the recovery strategy when aggressive pruning degrades quality beyond acceptable thresholds.

References and Further Reading

Foundational Pruning Methods

Frantar, E. & Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. ICML 2023.

Introduced the one-shot pruning paradigm for LLMs, demonstrating that billion-parameter models can be pruned to 50% or more sparsity without retraining. The layer-wise reconstruction approach scales linearly with model size.

Paper

Sun, M. et al. (2024). A Simple and Effective Pruning Approach for Large Language Models. ICLR 2024.

The Wanda paper. Shows that multiplying weight magnitudes by input activation norms produces pruning decisions comparable to SparseGPT at a fraction of the computational cost. An essential read for practical LLM pruning.

Paper

Structured Pruning

Ashkboos, S. et al. (2024). SliceGPT: Compress Large Language Models by Deleting Rows and Columns. ICLR 2024.

Demonstrates structured pruning of LLMs by removing entire rows and columns from weight matrices, achieving significant compression without sparse kernel requirements.

Paper

Sanh, V. et al. (2020). Movement Pruning: Adaptive Sparsity during Fine-Tuning. NeurIPS 2020.

Introduces movement pruning, which prunes weights based on their movement during fine-tuning rather than their magnitude. Achieves superior results for task-specific pruning compared to magnitude-based methods.

Paper

Hardware-Accelerated Sparsity

NVIDIA. (2021). Accelerating Inference with Sparsity Using Ampere and TensorRT. NVIDIA Developer Blog.

NVIDIA's guide to using 2:4 structured sparsity on Ampere GPUs. Covers the hardware mechanism, the sparse tensor core architecture, and practical guidance for achieving 2x speedups with the semi-structured pattern.

Documentation

PyTorch. (2025). Sparse Tensor Documentation. PyTorch Official Docs.

Official PyTorch documentation for sparse tensor operations, including semi-structured sparsity support via to_sparse_semi_structured. Essential reference for implementing the code examples in this section.

Documentation