This section is a practical companion to the quantization theory in Chapter 9. It focuses on the serving-specific workflow: how to load pre-quantized models in vLLM, TGI, and SGLang; how to convert models to GGUF for llama.cpp; and how to choose the right format for your deployment target.
For the mathematics of quantization (absmax, zero-point, per-group schemes), the algorithms behind GPTQ, AWQ, and bitsandbytes, calibration strategies, quality degradation analysis, and hands-on quantization labs, see Section 9.1: Model Quantization. This section assumes you have already quantized your model (or downloaded a pre-quantized checkpoint) and focuses on loading and serving it.
1. Serving GPTQ and AWQ Models
Both vLLM and TGI natively support GPTQ models. Point the server at a GPTQ-quantized model and specify the quantization format.
# Serve a GPTQ model with vLLM
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
--quantization gptq \
--tensor-parallel-size 2 \
--max-model-len 4096
# Serve a GPTQ model with TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-70B-Chat-GPTQ \
--quantize gptq
AWQ with the GEMM kernel version is generally recommended for serving scenarios where batch sizes exceed 1. The GEMV version is optimized for single-request (batch size 1) latency. If your workload mixes both, GEMM is the safer default. See Section 9.1 for the full AWQ quantization procedure.
2. GGUF and llama.cpp: CPU-Friendly Serving
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the popular C/C++ inference engine that runs LLMs on CPUs, Apple Silicon, and consumer GPUs. Unlike GPTQ and AWQ, which target NVIDIA GPUs and CUDA kernels, GGUF is designed for portability. It supports a wide range of quantization levels and can split computation between CPU and GPU.
GGUF supports quantization types ranging from Q2_K (roughly 2.5 bits per weight) to Q8_0 (8 bits per weight). The naming convention indicates the bit width and the quantization scheme. The most commonly used levels for serving are Q4_K_M and Q5_K_M, which balance quality and size.
| GGUF Type | Bits/Weight | Model Size (7B) | Quality Impact |
|---|---|---|---|
| Q2_K | ~2.5 | ~2.8 GB | Noticeable degradation; useful for experimentation only |
| Q3_K_M | ~3.5 | ~3.5 GB | Moderate degradation; acceptable for simple tasks |
| Q4_K_M | ~4.5 | ~4.4 GB | Minimal degradation; recommended for most use cases |
| Q5_K_M | ~5.5 | ~5.1 GB | Near-lossless; best quality-to-size ratio |
| Q6_K | ~6.5 | ~5.9 GB | Very close to FP16 quality |
| Q8_0 | 8.0 | ~7.2 GB | Virtually lossless |
To convert a HuggingFace model to GGUF format, use the conversion script included with llama.cpp.
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build (with CUDA support for GPU offloading)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Convert a HuggingFace model to GGUF
python convert_hf_to_gguf.py \
/path/to/meta-llama/Llama-3.1-8B-Instruct \
--outfile llama-3.1-8b.gguf \
--outtype f16
# Quantize to Q4_K_M
./build/bin/llama-quantize llama-3.1-8b.gguf llama-3.1-8b-Q4_K_M.gguf Q4_K_M
2.1 Running a GGUF Server
llama.cpp includes a built-in HTTP server with an OpenAI-compatible API. This makes it a viable serving option for environments without NVIDIA GPUs.
# Start the llama.cpp server with GPU offloading
./build/bin/llama-server \
-m llama-3.1-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 4096 \
--threads 8
The -ngl 99 flag offloads all layers to GPU. On systems without a GPU, omit this flag
and llama.cpp will run entirely on CPU, using SIMD optimizations (AVX2, AVX-512, ARM NEON) for
acceptable performance on modern processors.
3. bitsandbytes for Serving
For the bitsandbytes NF4 data type, double quantization, and complete code examples, see Section 9.1: Model Quantization (subsection 5).
TGI supports bitsandbytes quantization at startup with --quantize bitsandbytes-nf4,
quantizing on the fly without a pre-quantized checkpoint. This is convenient for prototyping but
produces 30% to 50% lower throughput than pre-quantized GPTQ or AWQ models due to the lack of
optimized CUDA kernels. For production serving, prefer GPTQ or AWQ.
4. Comparison: Choosing the Right Format for Serving
The choice of quantization method depends on your deployment environment, performance requirements, and workflow. The following table provides a side-by-side comparison.
| Aspect | GPTQ | AWQ | GGUF | bitsandbytes |
|---|---|---|---|---|
| Precision | 4-bit, 8-bit | 4-bit | 2-bit to 8-bit | 4-bit, 8-bit |
| Calibration required | Yes (128+ samples) | Yes (small dataset) | No | No |
| Quantization speed | Slow (hours for 70B) | Moderate (30-60 min) | Fast (minutes) | Instant (on load) |
| Serving engines | vLLM, TGI, SGLang | vLLM, TGI, SGLang | llama.cpp, Ollama | TGI, HF Transformers |
| Hardware | NVIDIA GPU only | NVIDIA GPU only | CPU, Apple Silicon, GPU | NVIDIA GPU only |
| Throughput | High | High | Moderate (GPU), Low (CPU) | Moderate |
| Quality (4-bit) | Very good | Very good | Good (Q4_K_M) | Good |
| Best for | GPU serving at scale | GPU serving at scale | CPU/edge deployment | Prototyping, fine-tuning |
A common production pattern is to maintain two quantized versions of the same model: an AWQ version for your GPU-based serving cluster (using vLLM or TGI) and a Q4_K_M GGUF version for local developer testing (using llama.cpp or Ollama). The quality difference between these formats at 4-bit is typically less than 1% on standard benchmarks, ensuring consistent behavior between development and production.
Summary
Quantization is an essential technique for making LLM serving practical and affordable. GPTQ and AWQ provide the highest throughput on NVIDIA GPUs through optimized CUDA kernels, making them ideal for production serving with vLLM and TGI. GGUF and llama.cpp open the door to CPU and edge deployment with remarkable efficiency. bitsandbytes offers the lowest barrier to entry for experimentation. In the next section, we examine how to scale these serving solutions horizontally and balance load across multiple GPU instances.