Back-of-the-envelope calculations help you budget before committing to expensive GPU rentals. The formulas below provide rough but useful estimates.
Estimating VRAM for Inference
The memory needed to load a model depends on its parameter count and the numerical precision of the weights:
Where bytes per parameter depends on precision: FP32 = 4 bytes, BF16/FP16 = 2 bytes, FP8 = 1 byte, INT4 = 0.5 bytes. Add 10-20% overhead for the KV cache, activation memory, and framework buffers.
Estimating Training Compute
The total compute required for one epoch of training can be estimated by the "6ND" rule:
Where N is the number of model parameters and D is the number of training tokens. The factor of 6 accounts for the forward pass (2ND) and backward pass (4ND). To convert from total FLOPs to GPU-hours:
MFU (Model FLOPS Utilization) represents what fraction of theoretical peak throughput you actually achieve. Typical values are 30-50% for well-optimized distributed training setups.
Estimating Fine-Tuning Cost
For LoRA/QLoRA fine-tuning, a practical rule of thumb:
In practice, QLoRA fine-tuning of a 7B model on 50K examples (each ~512 tokens) for 3 epochs takes approximately 2-4 hours on a single A100. Double the time for a 13B model, and roughly 5x for a 70B model.
Estimating Inference Cost per Token
For API-served models:
Where tokens per second depends on the model, hardware, batching strategy, and quantization level. A well-optimized vLLM deployment of a 70B model on an H100 can typically achieve 1,000-3,000 output tokens per second with continuous batching across concurrent requests.
For inference: buy memory bandwidth. For training: buy FLOPS. For fine-tuning: buy enough VRAM to hold your model in the cheapest possible precision. When in doubt, start with the smallest GPU that fits your model, measure actual performance, and scale up only if needed.