Building Conversational AI with LLMs and Agents
Appendix S: Inference Serving: vLLM, TGI, and SGLang

Quantization for Serving: GPTQ, AWQ, and GGUF

Big Picture

This section is a practical companion to the quantization theory in Chapter 9. It focuses on the serving-specific workflow: how to load pre-quantized models in vLLM, TGI, and SGLang; how to convert models to GGUF for llama.cpp; and how to choose the right format for your deployment target.

Covered in Detail

For the mathematics of quantization (absmax, zero-point, per-group schemes), the algorithms behind GPTQ, AWQ, and bitsandbytes, calibration strategies, quality degradation analysis, and hands-on quantization labs, see Section 9.1: Model Quantization. This section assumes you have already quantized your model (or downloaded a pre-quantized checkpoint) and focuses on loading and serving it.

1. Serving GPTQ and AWQ Models

Both vLLM and TGI natively support GPTQ models. Point the server at a GPTQ-quantized model and specify the quantization format.

# Serve a GPTQ model with vLLM
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# Serve a GPTQ model with TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantize gptq
Tip

AWQ with the GEMM kernel version is generally recommended for serving scenarios where batch sizes exceed 1. The GEMV version is optimized for single-request (batch size 1) latency. If your workload mixes both, GEMM is the safer default. See Section 9.1 for the full AWQ quantization procedure.

2. GGUF and llama.cpp: CPU-Friendly Serving

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the popular C/C++ inference engine that runs LLMs on CPUs, Apple Silicon, and consumer GPUs. Unlike GPTQ and AWQ, which target NVIDIA GPUs and CUDA kernels, GGUF is designed for portability. It supports a wide range of quantization levels and can split computation between CPU and GPU.

GGUF supports quantization types ranging from Q2_K (roughly 2.5 bits per weight) to Q8_0 (8 bits per weight). The naming convention indicates the bit width and the quantization scheme. The most commonly used levels for serving are Q4_K_M and Q5_K_M, which balance quality and size.

GGUF Type Bits/Weight Model Size (7B) Quality Impact
Q2_K~2.5~2.8 GBNoticeable degradation; useful for experimentation only
Q3_K_M~3.5~3.5 GBModerate degradation; acceptable for simple tasks
Q4_K_M~4.5~4.4 GBMinimal degradation; recommended for most use cases
Q5_K_M~5.5~5.1 GBNear-lossless; best quality-to-size ratio
Q6_K~6.5~5.9 GBVery close to FP16 quality
Q8_08.0~7.2 GBVirtually lossless
Figure S.4.1: GGUF quantization levels for a 7B parameter model. Q4_K_M offers the best balance of size and quality for most serving scenarios.

To convert a HuggingFace model to GGUF format, use the conversion script included with llama.cpp.

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build (with CUDA support for GPU offloading)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Convert a HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    /path/to/meta-llama/Llama-3.1-8B-Instruct \
    --outfile llama-3.1-8b.gguf \
    --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize llama-3.1-8b.gguf llama-3.1-8b-Q4_K_M.gguf Q4_K_M

2.1 Running a GGUF Server

llama.cpp includes a built-in HTTP server with an OpenAI-compatible API. This makes it a viable serving option for environments without NVIDIA GPUs.

# Start the llama.cpp server with GPU offloading
./build/bin/llama-server \
    -m llama-3.1-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \
    -c 4096 \
    --threads 8

The -ngl 99 flag offloads all layers to GPU. On systems without a GPU, omit this flag and llama.cpp will run entirely on CPU, using SIMD optimizations (AVX2, AVX-512, ARM NEON) for acceptable performance on modern processors.

3. bitsandbytes for Serving

Covered in Detail

For the bitsandbytes NF4 data type, double quantization, and complete code examples, see Section 9.1: Model Quantization (subsection 5).

TGI supports bitsandbytes quantization at startup with --quantize bitsandbytes-nf4, quantizing on the fly without a pre-quantized checkpoint. This is convenient for prototyping but produces 30% to 50% lower throughput than pre-quantized GPTQ or AWQ models due to the lack of optimized CUDA kernels. For production serving, prefer GPTQ or AWQ.

4. Comparison: Choosing the Right Format for Serving

The choice of quantization method depends on your deployment environment, performance requirements, and workflow. The following table provides a side-by-side comparison.

Quantization Methods for LLM Serving
Aspect GPTQ AWQ GGUF bitsandbytes
Precision4-bit, 8-bit4-bit2-bit to 8-bit4-bit, 8-bit
Calibration requiredYes (128+ samples)Yes (small dataset)NoNo
Quantization speedSlow (hours for 70B)Moderate (30-60 min)Fast (minutes)Instant (on load)
Serving enginesvLLM, TGI, SGLangvLLM, TGI, SGLangllama.cpp, OllamaTGI, HF Transformers
HardwareNVIDIA GPU onlyNVIDIA GPU onlyCPU, Apple Silicon, GPUNVIDIA GPU only
ThroughputHighHighModerate (GPU), Low (CPU)Moderate
Quality (4-bit)Very goodVery goodGood (Q4_K_M)Good
Best forGPU serving at scaleGPU serving at scaleCPU/edge deploymentPrototyping, fine-tuning
Practical Example

A common production pattern is to maintain two quantized versions of the same model: an AWQ version for your GPU-based serving cluster (using vLLM or TGI) and a Q4_K_M GGUF version for local developer testing (using llama.cpp or Ollama). The quality difference between these formats at 4-bit is typically less than 1% on standard benchmarks, ensuring consistent behavior between development and production.

Summary

Quantization is an essential technique for making LLM serving practical and affordable. GPTQ and AWQ provide the highest throughput on NVIDIA GPUs through optimized CUDA kernels, making them ideal for production serving with vLLM and TGI. GGUF and llama.cpp open the door to CPU and edge deployment with remarkable efficiency. bitsandbytes offers the lowest barrier to entry for experimentation. In the next section, we examine how to scale these serving solutions horizontally and balance load across multiple GPU instances.