Building Conversational AI with LLMs and Agents
Appendix G

GPU Hardware and Cloud Compute

A practical guide to GPUs, cloud pricing, and cost estimation for training, fine-tuning, and serving large language models

Cross-section of a GPU datacenter with friendly robots routing data through colorful interconnects, heat rising from chips, and cooling systems working overhead
Big Picture

This appendix is a practical reference for GPU selection, cloud pricing, and cost estimation for training, fine-tuning, and serving large language models. It covers the major accelerator families (NVIDIA A100, H100, L40S, consumer-grade cards; Google TPUs; AMD MI300X), cloud provider pricing across AWS, GCP, and Azure, a decision framework for matching hardware to task requirements, and formulas for estimating compute cost before committing to a run.

Hardware decisions are among the highest-leverage choices in an LLM project. Running fine-tuning on an undersized GPU causes out-of-memory failures or forces batch size reductions that slow training significantly. Overprovisioning cloud compute wastes budget. A working knowledge of what each GPU tier can handle, and how to estimate cost before a run, is a practical skill that pays for itself immediately.

This appendix is most valuable for ML engineers, researchers, and technical leads who are responsible for infrastructure decisions. It is also useful for anyone running hands-on experiments who wants to understand why specific hardware configurations are recommended in the tutorial chapters.

Hardware selection is most consequential when planning fine-tuning runs covered in Chapter 14 and Chapter 15 (PEFT). Pretraining compute requirements and scaling laws are discussed in Chapter 6 (Pretraining and Scaling Laws). Production serving hardware decisions connect to Chapter 31 (Production Engineering).

Prerequisites

No specific chapter prerequisites are required. A basic understanding of what a GPU does (parallel matrix computation) is assumed. The cost estimation formulas in Section G.4 use simple arithmetic; familiarity with parameter counts and token throughput concepts from Chapter 6 will help you apply them to realistic scenarios.

When to Use This Appendix

Consult this appendix before starting any fine-tuning or training run to verify your hardware is sufficient and to estimate cost. Use Section G.3 when choosing between cloud instances or deciding whether to use QLoRA (smaller GPU requirement) versus full fine-tuning. Use Section G.4 before kicking off long training jobs. If you are exclusively using inference APIs (OpenAI, Anthropic, Google) without running your own models, this appendix can be deferred. For environment setup on your chosen hardware, see Appendix D.

Sections