Appendices
Appendix G: GPU Hardware and Cloud Compute

Choosing the Right GPU for Your Task

Different workloads stress different GPU characteristics. The following guide maps common LLM tasks to recommended hardware configurations.

Inference (Serving a Model)

Inference is typically memory-bandwidth-bound. The priority is: (1) enough VRAM to hold the model weights plus the KV cache, and (2) high memory bandwidth for fast token generation.

Inference (Serving a Model) Comparison
Model Size Precision VRAM Needed Recommended GPU(s)
7B-8B 4-bit (GPTQ/AWQ) ~6 GB Any consumer GPU (RTX 3060 12GB+), L40S
7B-8B BF16 ~16 GB RTX 4090, L40S, A100
13B-14B 4-bit ~10 GB RTX 4080+, L40S
70B 4-bit ~40 GB 1x A100 80GB, 1x H100, 1x MI300X
70B BF16 ~140 GB 2x A100, 1x H200, 1x MI300X
405B (Llama 3.1) FP8 ~405 GB 8x A100, 4x H200, 3x MI300X

Fine-Tuning (LoRA / QLoRA)

Fine-tuning requires enough memory for the model weights, optimizer states, gradients, and activations. QLoRA reduces this dramatically by quantizing the base model to 4-bit.

Fine-Tuning (LoRA / QLoRA) Comparison
Model Size Method VRAM Needed Recommended GPU(s)
7B-8B QLoRA (4-bit) ~10 GB RTX 3060 12GB, RTX 4070
7B-8B Full fine-tune (BF16) ~60 GB 1x A100 80GB, 1x H100
70B QLoRA (4-bit) ~48 GB 1x A100 80GB, 1x H100, RTX 4090 (tight)
70B Full fine-tune (BF16) ~500 GB 8x A100 80GB, 4x H200

Pretraining from Scratch

Pretraining is compute-bound: raw TFLOPS and multi-GPU scaling efficiency matter most. Large-scale pretraining also requires massive storage throughput and reliable networking.

Pretraining from Scratch Comparison
Model Size Training Tokens Approx. GPU-Hours (H100) Hardware Recommendation
1B 100B tokens ~500 H100-hours 8x H100 node (2-3 days)
7B 1T tokens ~10,000 H100-hours 32-64x H100 cluster (1-2 weeks)
70B 15T tokens ~1,700,000 H100-hours 2,000+ H100 cluster (months)
When VRAM Is the Bottleneck

For most practitioners, VRAM is the primary constraint. If a model does not fit in GPU memory, quantization (see Section 9.1) is often a better first step than buying a larger GPU.