Section 31.7: Edge and On-Device LLM Deployment

The fastest API call is the one you never make. The most private data is the data that never leaves the device.
Every engineer who has waited for a cloud inference round trip on a spotty connection

Big Picture

Not every LLM inference should travel to the cloud. Privacy constraints, unreliable connectivity, latency requirements, and per-query API costs at scale all create demand for on-device inference. This section surveys the leading edge deployment frameworks (llama.cpp, Ollama, MLX, ExecuTorch), the quantization techniques that make large models fit on consumer hardware, and the battery and thermal constraints that shape mobile inference strategies. The inference optimization techniques from Chapter 9 provide the theoretical foundation, while this section focuses on the practical tools and tradeoffs for deploying models at the edge.

Prerequisites

This section builds on inference optimization from Chapter 9, deployment architecture from Section 31.1, and quantization fundamentals from Chapter 16.

A tiny cartoon robot sitting on a smartphone, running a miniaturized version of a large brain that has been compressed to fit inside the small device, with the large original brain visible in the background inside a data center. — Edge deployment moves the model to the user's hardware, eliminating network latency, removing per-query API costs, and keeping sensitive data on the device.

1. Why Edge Deployment Matters

Not every LLM inference should travel to the cloud. When a physician uses an AI assistant in a hospital without reliable internet, when a mobile app needs sub-100ms autocomplete without per-query API costs, or when a defense contractor cannot send data to a third-party server, on-device inference is not a nice-to-have; it is a requirement. Edge deployment moves the model to the user's hardware, eliminating network latency, removing per-query API costs, enabling offline operation, and keeping sensitive data on the device.

The economics are compelling at scale. An application serving 10 million daily queries at $0.002 per query spends $20,000 per day on API costs. If a quantized 3B-parameter model running on the user's device can handle 80% of those queries with acceptable quality, the savings are substantial. The trade-off is clear: smaller models with lower quality versus larger cloud models with higher quality, and the art of edge deployment is finding the right balance for your use case.

Key Insight

Edge deployment is not about replacing cloud models. The most effective production architectures use a tiered approach: a small on-device model handles simple queries (autocomplete, classification, formatting) with zero latency and zero cost, while complex queries (multi-step reasoning, long-context synthesis) are routed to cloud models. The on-device model also serves as a fallback when the network is unavailable, providing degraded but functional service rather than a blank screen.

Fun Fact

Apple's on-device language model for iOS 18 runs a 3B-parameter model that fits in 1.5 GB of memory after quantization. It handles autocomplete, message summarization, and notification prioritization without sending a single token to the cloud. Your phone is now running a language model that would have been considered state-of-the-art in 2021, and it does so while you are checking your email in airplane mode.

Use Case Matrix

Use Case	Primary Driver	Typical Model Size	Target Hardware
Medical records assistant	Privacy (HIPAA)	3B to 8B	Workstation GPU
Mobile keyboard autocomplete	Latency, cost	0.5B to 1B	Phone NPU/GPU
Offline field assistant	No connectivity	1B to 3B	Laptop CPU
Smart home device	Latency, privacy	0.1B to 0.5B	ARM SoC
Enterprise document processing	Data sovereignty	8B to 70B	On-premise GPU cluster

2. llama.cpp: Universal C/C++ Inference

llama.cpp, created by Georgi Gerganov, is the foundational project for running LLMs on consumer hardware. Written in pure C/C++ with no Python dependencies, it compiles and runs on virtually any platform: Linux, macOS, Windows, Android, iOS, and even Raspberry Pi. The project introduced the GGUF (GPT-Generated Unified Format) quantization format, which has become the standard for distributing quantized models. llama.cpp supports dozens of model architectures (Llama, Mistral, Phi, Qwen, Gemma, and many others) and provides both a CLI interface and a built-in HTTP server compatible with the OpenAI API format.

# Build llama.cpp from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # For NVIDIA GPUs; omit for CPU-only
cmake --build build --config Release -j $(nproc)

# Download a GGUF model (example: Llama 3.2 3B at Q4_K_M quantization)
# Models are available on Hugging Face in GGUF format
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run interactive chat
./build/bin/llama-cli \
 -m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
 --chat-template llama3 \
 -c 4096 \
 -ngl 99 # Offload all layers to GPU

# Start an OpenAI-compatible API server
./build/bin/llama-server \
 -m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
 --host 0.0.0.0 --port 8080 \
 -c 4096 -ngl 99

Quantization in LLMs reduces the numerical precision of model weights from 16-bit or 32-bit floating point to lower bit widths like 8-bit or 4-bit integers. This dramatically reduces memory requirements and speeds up inference, often with minimal impact on output quality for well-calibrated quantization methods. Edge deployment refers to running AI models directly on end-user devices such as smartphones, laptops, or IoT hardware rather than in cloud data centers. This approach reduces latency, improves privacy by keeping data on-device, and eliminates dependency on network connectivity...

Code Fragment 31.7.1: Build llama.cpp from source

GGUF Quantization Levels

GGUF models come in various quantization levels that trade quality for memory and speed. The naming convention encodes the bit width and quantization method. Understanding these trade-offs is essential for choosing the right model variant for your hardware constraints.

Quantization	Bits per Weight	Size (7B model)	Quality Impact	Best For
Q8_0	8.0	~7.2 GB	Near-lossless	Maximum quality, ample RAM
Q6_K	6.6	~5.5 GB	Very small loss	High quality, moderate RAM
Q5_K_M	5.7	~4.8 GB	Small loss	Good balance for most uses
Q4_K_M	4.8	~4.1 GB	Moderate loss	Most popular; fits 8GB VRAM
Q3_K_M	3.9	~3.3 GB	Noticeable loss	Tight memory constraints
Q2_K	3.4	~2.8 GB	Significant loss	Extreme constraints only

Quantization Quality Cliff

Quality degradation from quantization is not linear. Models typically maintain near-full quality down to Q5_K_M (5 to 6 bits), show modest degradation at Q4_K_M (4 to 5 bits), and then experience a steeper quality drop below 4 bits. The exact cliff depends on model architecture and size: larger models (70B+) tolerate aggressive quantization better than smaller models (3B) because they have more redundancy. Always benchmark your specific use case at multiple quantization levels rather than relying on general guidelines.

3. Ollama: Developer-Friendly Local Model Management

Ollama wraps llama.cpp (and other backends) in a user-friendly interface inspired by Docker. Instead of downloading GGUF files manually and managing command-line flags, Ollama provides a pull/run workflow that handles model downloads, GPU detection, memory management, and API serving automatically. It supports macOS, Linux, and Windows, and exposes an OpenAI-compatible API by default on port 11434.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain edge deployment in one paragraph."

# List available models
ollama list

# Run with a specific quantization
ollama pull llama3.2:3b-instruct-q5_K_M
ollama run llama3.2:3b-instruct-q5_K_M

Code Fragment 31.7.2: Ollama installation, model pulling, and listing workflow. The pull/run commands mirror Docker semantics, and the quantization tag (q5_K_M) selects a specific quality/size tradeoff at download time.

Custom Modelfiles

Ollama's Modelfile system allows you to create custom model configurations with specific system prompts, parameters, and templates. This is useful for packaging a fine-tuned or customized model as a reusable unit that team members can pull and run identically.

# Modelfile: a custom medical assistant configuration
FROM llama3.2:3b-instruct-q5_K_M

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|eot_id|>"

SYSTEM """You are a medical terminology assistant running on a hospital
workstation. You help clinicians look up drug interactions, medical
terminology, and clinical guidelines. You always include a disclaimer
that your outputs are for reference only and do not constitute medical
advice. All data stays on this device; no information is sent externally."""

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

Code Fragment 31.7.3: A Modelfile that customizes a Llama 3.2 model for medical terminology lookups. The PARAMETER directives set a low temperature (0.3) for factual consistency, and the SYSTEM block injects a domain-specific persona with a mandatory disclaimer.

# Build and run the custom model
ollama create medical-assistant -f Modelfile
ollama run medical-assistant "What are the contraindications for metformin?"

Code Fragment 31.7.4: Building and running the custom Modelfile. The ollama create command packages the base model, parameters, and system prompt into a single named unit that any team member can launch with ollama run.

Programmatic Access

This snippet shows how to query benchmark results programmatically through the API.

"""Using Ollama's API from Python (OpenAI-compatible endpoint)."""

from openai import OpenAI

# Ollama exposes an OpenAI-compatible API on localhost:11434
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
 model="llama3.2:3b",
 messages=[
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "What is quantization in LLMs?"},
 ],
 temperature=0.7,
 max_tokens=500,
)
print(response.choices[0].message.content)

# Streaming works identically to the OpenAI API
stream = client.chat.completions.create(
 model="llama3.2:3b",
 messages=[{"role": "user", "content": "Explain edge deployment."}],
 stream=True,
)
for chunk in stream:
 if chunk.choices[0].delta.content:
 print(chunk.choices[0].delta.content, end="", flush=True)

Code Fragment 31.7.5: Ollama exposes an OpenAI-compatible API on localhost:11434

4. MLX: Optimized Inference on Apple Silicon

Apple's MLX framework is designed specifically for Apple Silicon (M1 through M4 chips), exploiting the unified memory architecture that allows the CPU, GPU, and Neural Engine to share the same memory without copying. For Mac users, MLX often delivers better performance than llama.cpp because it uses Metal shaders optimized for Apple's GPU architecture. The companion library mlx-lm provides a high-level interface for text generation with Hugging Face model compatibility.

# Install MLX and mlx-lm
pip install mlx mlx-lm

# Run a model directly from Hugging Face
mlx_lm.generate \
 --model mlx-community/Llama-3.2-3B-Instruct-4bit \
 --prompt "Explain the benefits of on-device inference." \
 --max-tokens 200

# Convert a Hugging Face model to MLX format with quantization
mlx_lm.convert \
 --hf-path meta-llama/Llama-3.2-3B-Instruct \
 --mlx-path ./llama-3.2-3b-mlx-4bit \
 --quantize \
 --q-bits 4 \
 --q-group-size 64

Prompt: 12 tokens, 1148.3 tokens/s Generation: 300 tokens, 87.2 tokens/s MLX offers several advantages for on-device inference on Apple Silicon. First, it leverages the unified memory architecture, allowing the CPU and GPU to share the same memory without costly data transfers. Second, it uses Metal shaders optimized for Apple's GPU cores, achieving higher throughput than generic GGML-based solutions. Third, it supports lazy evaluation and graph compilation, which reduces overhead for repeated inference calls...

Code Fragment 31.7.6: Install MLX and mlx-lm

"""MLX text generation with streaming."""

from mlx_lm import load, generate

# Load a quantized model (downloads from HF if not cached)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate text
prompt = tokenizer.apply_chat_template(
 [{"role": "user", "content": "What are the advantages of MLX?"}],
 tokenize=False,
 add_generation_prompt=True,
)

response = generate(
 model,
 tokenizer,
 prompt=prompt,
 max_tokens=300,
 temp=0.7,
 verbose=True, # Prints tokens/sec performance
)
print(response)

Code Fragment 31.7.7: Load a quantized model (downloads from HF if not cached)

MLX's unified memory model means that a MacBook Pro with 36GB of RAM can run a 4-bit quantized 30B model entirely in memory without the CPU-to-GPU transfer bottleneck that limits performance on discrete GPU systems. For development and prototyping workflows on Apple hardware, MLX provides the fastest path from model selection to running inference.

5. ExecuTorch: PyTorch Models on Mobile and Edge

Meta's ExecuTorch is the production runtime for deploying PyTorch models on mobile phones, IoT devices, and other resource-constrained hardware. Unlike llama.cpp (which requires models in GGUF format) or MLX (which targets Apple Silicon), ExecuTorch takes standard PyTorch models and exports them to an optimized format (.pte) that runs on Android, iOS, and embedded Linux with hardware-specific acceleration. ExecuTorch supports Qualcomm Hexagon DSPs, Apple Core ML, MediaTek APUs, and ARM CPU backends.

"""Export a model for ExecuTorch deployment (simplified workflow)."""

import torch
from executorch.exir import to_edge
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small model suitable for mobile
model_name = "microsoft/phi-2" # 2.7B parameters
model = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Trace the model for export
example_input = tokenizer("Hello", return_tensors="pt")
traced = torch.export.export(
 model,
 (example_input["input_ids"],),
 dynamic_shapes={
 "input_ids": {1: torch.export.Dim("seq_len", min=1, max=512)}
 },
)

# Convert to ExecuTorch edge format
edge_program = to_edge(traced)

# Delegate to hardware-specific backends
# For Qualcomm: edge_program = edge_program.to_backend(QnnBackend())
# For CoreML: edge_program = edge_program.to_backend(CoreMLBackend())

# Export the final .pte file
et_program = edge_program.to_executorch()
with open("phi2_mobile.pte", "wb") as f:
 f.write(et_program.buffer)

print("Exported model size:", len(et_program.buffer) / 1e6, "MB")

Exported model size: 1427.3 MB

Code Fragment 31.7.8: Working with torch, executorch, to_edge, transformers

ExecuTorch's primary advantage over llama.cpp for mobile deployment is its integration with the PyTorch ecosystem. If your model is already in PyTorch (as most research models are), ExecuTorch provides a direct export path without format conversion. It also supports hardware-specific optimizations through a delegate system, where computation-heavy operations are offloaded to specialized accelerators (NPUs, DSPs) that llama.cpp cannot access.

Choosing Your Edge Runtime

The choice between llama.cpp, Ollama, MLX, and ExecuTorch depends on your target platform and deployment constraints. llama.cpp is the universal choice: it runs everywhere and supports the widest range of models. Ollama wraps llama.cpp for developer convenience and is ideal for local development and prototyping. MLX is the best option for Apple Silicon Macs, offering superior performance through Metal optimization. ExecuTorch is the right choice when you need to deploy on mobile phones or IoT devices with hardware-specific acceleration. Many production systems use multiple runtimes: Ollama for developer machines, ExecuTorch for the mobile app, and a cloud API as the fallback.

6. Battery and Thermal Constraints on Mobile

Running LLM inference on a mobile device introduces constraints that do not exist in server environments. Battery drain is the most visible: sustained LLM inference can consume 3 to 5 watts on a modern smartphone, draining the battery at a rate of roughly 1% per minute of continuous generation. Thermal throttling is equally important; most mobile SoCs reduce clock speeds after 30 to 60 seconds of sustained compute to prevent overheating, which degrades generation speed mid-response.

Practical mitigations include: (1) using speculative decoding with a tiny draft model to reduce the number of full-model forward passes, (2) capping generation length to prevent extended inference sessions, (3) batching requests when possible to amortize model loading overhead, (4) monitoring device temperature and gracefully degrading to shorter responses or cloud fallback when thermal limits approach, and (5) using the smallest model variant that meets quality requirements. The quality difference between a Q4_K_M and Q5_K_M quantization on a 3B model is often imperceptible to users, but the memory and power savings can extend battery life by 15 to 20%.

Lab: Quantization Quality vs. Latency Benchmark

Step 1: Set Up the Environment

This snippet installs the required dependencies and configures environment variables for the benchmark.

# Ensure Ollama is installed and running
ollama --version

# Pull two quantization levels of the same model
ollama pull llama3.2:3b-instruct-q4_K_M
ollama pull llama3.2:3b-instruct-q8_0

Benchmarking llama3.2:3b-instruct-q4_K_M... Average: 42.3 tokens/sec Benchmarking llama3.2:3b-instruct-q8_0... Average: 31.7 tokens/sec

Code Fragment 31.7.9: Ensure Ollama is installed and running

Step 2: Run the Benchmark

This snippet executes the benchmark suite and collects the results.

"""Benchmark two quantization levels on the same prompts."""

import time
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

PROMPTS = [
 "Explain the concept of quantization in neural networks in 3 sentences.",
 "Write a Python function that computes the Fibonacci sequence iteratively.",
 "Summarize the key differences between TCP and UDP protocols.",
 "What are the main causes of the French Revolution? List 5 factors.",
 "Translate this to formal English: 'gonna grab some food brb'",
 "Write a SQL query to find the top 10 customers by total order value.",
 "Explain photosynthesis to a 10-year-old in simple terms.",
 "What are three common logical fallacies? Give an example of each.",
 "Write a bash one-liner to count the number of .py files recursively.",
 "Compare and contrast microservices and monolithic architectures.",
]

MODELS = ["llama3.2:3b-instruct-q4_K_M", "llama3.2:3b-instruct-q8_0"]

def benchmark_model(model_name: str, prompts: list[str]) -> dict:
 results = []
 for prompt in prompts:
 start = time.perf_counter()
 response = client.chat.completions.create(
 model=model_name,
 messages=[{"role": "user", "content": prompt}],
 max_tokens=300,
 temperature=0.0, # Deterministic for comparison
 )
 elapsed = time.perf_counter() - start
 output = response.choices[0].message.content
 token_count = response.usage.completion_tokens

 results.append({
 "prompt": prompt[:60],
 "output": output,
 "tokens": token_count,
 "time_s": round(elapsed, 2),
 "tok_per_s": round(token_count / elapsed, 1),
 })

 avg_tps = sum(r["tok_per_s"] for r in results) / len(results)
 return {
 "model": model_name,
 "avg_tokens_per_sec": avg_tps,
 "results": results,
 }

# Run benchmarks
for model in MODELS:
 print(f"\nBenchmarking {model}...")
 report = benchmark_model(model, PROMPTS)
 print(f" Average: {report['avg_tokens_per_sec']:.1f} tokens/sec")
 with open(f"benchmark_{model.replace(':', '_')}.json", "w") as f:
 json.dump(report, f, indent=2)

Code Fragment 31.7.10: Implementation of benchmark_model

Step 3: Evaluate Quality

Compare the outputs from both quantization levels side by side. For each of the 10 prompts, rate the Q4_K_M output on a 1 to 5 scale relative to the Q8_0 output: 5 means identical quality, 4 means minor differences that do not affect usefulness, 3 means noticeable degradation, 2 means significant quality loss, and 1 means the output is unusable. Compute the average quality score and report it alongside the latency numbers. Typical results for a 3B model show average quality scores of 4.2 to 4.7, confirming that Q4_K_M is viable for most applications.

Exercise 31.7.1: Three-Way Quantization Comparison Project

Extend the lab benchmark to include Q5_K_M as a third quantization level. Plot tokens/sec vs. quality score for all three levels. Is Q5_K_M the best compromise, or does Q4_K_M offer sufficient quality at meaningfully better speed?

Answer Sketch

Q5_K_M typically sits midway between Q4_K_M and Q8_0 on both metrics. The quality difference between Q4_K_M and Q5_K_M is usually small (0.1 to 0.3 points on the 5-point scale), while the speed difference is also modest (5 to 15% faster for Q4_K_M). For applications where every millisecond matters (autocomplete, real-time suggestions), Q4_K_M is preferred. For applications where quality is paramount but memory is limited (medical reference, legal document review), Q5_K_M provides a better balance.

Exercise 31.7.2: Tiered Deployment Architecture Design

Design a tiered inference system for a mobile application that uses a Q4_K_M model on-device for simple queries and routes complex queries to a cloud API. Define the routing criteria, implement a complexity classifier, and measure the cost savings compared to sending all queries to the cloud.

Answer Sketch

Route queries to the on-device model when: (1) the query is under 50 tokens, (2) the task is classification, extraction, or short-form generation, and (3) the device battery is above 20%. Route to the cloud when: the query requires multi-step reasoning, long-form generation (over 500 tokens), or references context beyond the on-device model's knowledge. A simple keyword/length classifier can achieve 85%+ routing accuracy. At 70% on-device routing, total API costs drop by approximately 70%, with user-perceived quality dropping less than 5% (measured by blind comparison).

Exercise 31.7.3: MLX vs. llama.cpp on Apple Silicon Project

If you have access to an Apple Silicon Mac, benchmark the same model using both MLX and llama.cpp. Compare tokens/sec, time to first token, memory usage, and power consumption (using the powermetrics tool). Which runtime is faster for your hardware?

Answer Sketch

On M1/M2 Macs, MLX typically outperforms llama.cpp by 10 to 30% in tokens/sec for 4-bit quantized models, with the gap widening for larger models that benefit more from Metal's unified memory access patterns. llama.cpp may show better time-to-first-token for small models due to lower initialization overhead. Power consumption is similar because both frameworks saturate the same GPU cores; the difference in tokens/sec means MLX completes the same work using less total energy per response.

Self-Check Questions

What is the key architectural advantage of llama.cpp that makes it run on such a wide range of hardware (CPUs, GPUs, mobile devices)?
Ollama exposes an OpenAI-compatible API. Why is API compatibility important for edge deployment, and how does it simplify the transition from cloud to local inference?
MLX exploits Apple Silicon's unified memory architecture. Explain why this gives MLX a performance advantage over llama.cpp on Mac hardware for large models.
In the quantization benchmark lab, you compared Q4_K_M and Q8_0 variants. What quality/latency trade-offs would you expect, and how would you decide which to deploy in production?

Key Takeaways

Edge deployment brings LLM inference to devices where cloud connectivity is unreliable, latency-sensitive, or privacy-constrained.
llama.cpp provides universal C/C++ inference with GGUF quantization, running models from laptops to Raspberry Pi devices.
Ollama wraps llama.cpp in a developer-friendly interface with model management, an OpenAI-compatible API, and one-command setup.
MLX delivers optimized inference on Apple Silicon, leveraging the unified memory architecture for efficient model loading.
ExecuTorch extends PyTorch to mobile and edge devices with ahead-of-time compilation and hardware-specific delegates.
Battery and thermal constraints on mobile devices require adaptive inference strategies that reduce model activity when resources are scarce.

What Comes Next

With edge deployment patterns established, the next part of the book addresses Safety & Strategy, beginning with Chapter 32: Safety, Ethics, and Regulation. Privacy and data sovereignty requirements for on-device models connect directly to the regulatory frameworks covered there.

References & Further Reading

Inference Runtimes

Gerganov, G. (2023). "llama.cpp: LLM Inference in C/C++." github.com/ggerganov/llama.cpp

The foundational project for running LLMs on consumer hardware. Introduced the GGUF format and supports dozens of model architectures across all major platforms. The most widely used local inference engine.

llama.cppC++ Inference

Ollama. (2024). "Ollama: Get Up and Running with Large Language Models." ollama.com

Developer-friendly wrapper around llama.cpp providing Docker-like model management, automatic GPU detection, and an OpenAI-compatible API. The easiest path to running models locally.

OllamaLocal Models

Apple Machine Learning Research. (2024). "MLX: An Array Framework for Apple Silicon." github.com/ml-explore/mlx

Apple's machine learning framework optimized for unified memory on Apple Silicon. Provides NumPy-like APIs with automatic differentiation and lazy evaluation, with specialized support for transformer inference.

MLXApple Silicon

Meta. (2024). "ExecuTorch: End-to-End Solution for Enabling On-Device AI." pytorch.org/executorch

Meta's production runtime for deploying PyTorch models on mobile and edge devices. Supports hardware-specific delegates for Qualcomm, Apple, and MediaTek accelerators.

ExecuTorchMobile Deployment

Quantization

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." arXiv:2305.14314

Introduces 4-bit NormalFloat quantization and demonstrates that quantized models can be fine-tuned with LoRA adapters. Foundational work for combining quantization with adaptation on consumer hardware.

QLoRA4-bit Quantization

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." arXiv:2210.17323

Presents the GPTQ algorithm for one-shot weight quantization using approximate second-order information. One of the two dominant quantization methods (alongside AWQ) for creating deployment-ready quantized models.

GPTQPost-Training Quantization

Lin, J., Tang, J., Tang, H., et al. (2024). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978

Introduces activation-aware quantization that preserves important weights based on activation magnitudes. Achieves better quality than uniform quantization at the same bit width, particularly at 4-bit levels.

AWQActivation-Aware