Different workloads stress different GPU characteristics. The following guide maps common LLM tasks to recommended hardware configurations.
Inference (Serving a Model)
Inference is typically memory-bandwidth-bound. The priority is: (1) enough VRAM to hold the model weights plus the KV cache, and (2) high memory bandwidth for fast token generation.
| Model Size | Precision | VRAM Needed | Recommended GPU(s) |
|---|---|---|---|
| 7B-8B | 4-bit (GPTQ/AWQ) | ~6 GB | Any consumer GPU (RTX 3060 12GB+), L40S |
| 7B-8B | BF16 | ~16 GB | RTX 4090, L40S, A100 |
| 13B-14B | 4-bit | ~10 GB | RTX 4080+, L40S |
| 70B | 4-bit | ~40 GB | 1x A100 80GB, 1x H100, 1x MI300X |
| 70B | BF16 | ~140 GB | 2x A100, 1x H200, 1x MI300X |
| 405B (Llama 3.1) | FP8 | ~405 GB | 8x A100, 4x H200, 3x MI300X |
Fine-Tuning (LoRA / QLoRA)
Fine-tuning requires enough memory for the model weights, optimizer states, gradients, and activations. QLoRA reduces this dramatically by quantizing the base model to 4-bit.
| Model Size | Method | VRAM Needed | Recommended GPU(s) |
|---|---|---|---|
| 7B-8B | QLoRA (4-bit) | ~10 GB | RTX 3060 12GB, RTX 4070 |
| 7B-8B | Full fine-tune (BF16) | ~60 GB | 1x A100 80GB, 1x H100 |
| 70B | QLoRA (4-bit) | ~48 GB | 1x A100 80GB, 1x H100, RTX 4090 (tight) |
| 70B | Full fine-tune (BF16) | ~500 GB | 8x A100 80GB, 4x H200 |
Pretraining from Scratch
Pretraining is compute-bound: raw TFLOPS and multi-GPU scaling efficiency matter most. Large-scale pretraining also requires massive storage throughput and reliable networking.
| Model Size | Training Tokens | Approx. GPU-Hours (H100) | Hardware Recommendation |
|---|---|---|---|
| 1B | 100B tokens | ~500 H100-hours | 8x H100 node (2-3 days) |
| 7B | 1T tokens | ~10,000 H100-hours | 32-64x H100 cluster (1-2 weeks) |
| 70B | 15T tokens | ~1,700,000 H100-hours | 2,000+ H100 cluster (months) |
For most practitioners, VRAM is the primary constraint. If a model does not fit in GPU memory, quantization (see Section 9.1) is often a better first step than buying a larger GPU.