Section G.4: Cost Estimation Formulas

Back-of-the-envelope calculations help you budget before committing to expensive GPU rentals. The formulas below provide rough but useful estimates.

Estimating VRAM for Inference

The memory needed to load a model depends on its parameter count and the numerical precision of the weights:

VRAM (GB) \approx Parameters (B) \times Bytes per Parameter

Where bytes per parameter depends on precision: FP32 = 4 bytes, BF16/FP16 = 2 bytes, FP8 = 1 byte, INT4 = 0.5 bytes. Add 10-20% overhead for the KV cache, activation memory, and framework buffers.

Estimating Training Compute

The total compute required for one epoch of training can be estimated by the "6ND" rule:

FLOPs \approx 6 \times N \times D

Where N is the number of model parameters and D is the number of training tokens. The factor of 6 accounts for the forward pass (2ND) and backward pass (4ND). To convert from total FLOPs to GPU-hours:

GPU-Hours \approx Total FLOPs / (GPU TFLOPS \times 10^{12} \times 3600 \times MFU)

MFU (Model FLOPS Utilization) represents what fraction of theoretical peak throughput you actually achieve. Typical values are 30-50% for well-optimized distributed training setups.

Estimating Fine-Tuning Cost

For LoRA/QLoRA fine-tuning, a practical rule of thumb:

Time (hours) \approx (Dataset tokens \times Epochs \times 6 \times N) / (GPU TFLOPS \times 10^{12} \times 3600 \times MFU)

In practice, QLoRA fine-tuning of a 7B model on 50K examples (each ~512 tokens) for 3 epochs takes approximately 2-4 hours on a single A100. Double the time for a 13B model, and roughly 5x for a 70B model.

Estimating Inference Cost per Token

For API-served models:

Cost per 1K tokens \approx (GPU hourly rate) / (Tokens per second \times 3.6)

Where tokens per second depends on the model, hardware, batching strategy, and quantization level. A well-optimized vLLM deployment of a 70B model on an H100 can typically achieve 1,000-3,000 output tokens per second with continuous batching across concurrent requests.

The Golden Rule of GPU Selection

For inference: buy memory bandwidth. For training: buy FLOPS. For fine-tuning: buy enough VRAM to hold your model in the cheapest possible precision. When in doubt, start with the smallest GPU that fits your model, measure actual performance, and scale up only if needed.