Section G.3: Choosing the Right GPU for Your Task

Different workloads stress different GPU characteristics. The following guide maps common LLM tasks to recommended hardware configurations.

Inference (Serving a Model)

Inference is typically memory-bandwidth-bound. The priority is: (1) enough VRAM to hold the model weights plus the KV cache, and (2) high memory bandwidth for fast token generation.

Inference (Serving a Model) Comparison

Model Size	Precision	VRAM Needed	Recommended GPU(s)
7B-8B	4-bit (GPTQ/AWQ)	~6 GB	Any consumer GPU (RTX 3060 12GB+), L40S
7B-8B	BF16	~16 GB	RTX 4090, L40S, A100
13B-14B	4-bit	~10 GB	RTX 4080+, L40S
70B	4-bit	~40 GB	1x A100 80GB, 1x H100, 1x MI300X
70B	BF16	~140 GB	2x A100, 1x H200, 1x MI300X
405B (Llama 3.1)	FP8	~405 GB	8x A100, 4x H200, 3x MI300X

Fine-Tuning (LoRA / QLoRA)

Fine-tuning requires enough memory for the model weights, optimizer states, gradients, and activations. QLoRA reduces this dramatically by quantizing the base model to 4-bit.

Fine-Tuning (LoRA / QLoRA) Comparison

Model Size	Method	VRAM Needed	Recommended GPU(s)
7B-8B	QLoRA (4-bit)	~10 GB	RTX 3060 12GB, RTX 4070
7B-8B	Full fine-tune (BF16)	~60 GB	1x A100 80GB, 1x H100
70B	QLoRA (4-bit)	~48 GB	1x A100 80GB, 1x H100, RTX 4090 (tight)
70B	Full fine-tune (BF16)	~500 GB	8x A100 80GB, 4x H200

Pretraining from Scratch

Pretraining is compute-bound: raw TFLOPS and multi-GPU scaling efficiency matter most. Large-scale pretraining also requires massive storage throughput and reliable networking.

Pretraining from Scratch Comparison

Model Size	Training Tokens	Approx. GPU-Hours (H100)	Hardware Recommendation
1B	100B tokens	~500 H100-hours	8x H100 node (2-3 days)
7B	1T tokens	~10,000 H100-hours	32-64x H100 cluster (1-2 weeks)
70B	15T tokens	~1,700,000 H100-hours	2,000+ H100 cluster (months)

When VRAM Is the Bottleneck

For most practitioners, VRAM is the primary constraint. If a model does not fit in GPU memory, quantization (see Section 9.1) is often a better first step than buying a larger GPU.