"The carbon footprint of training a single large model can exceed the lifetime emissions of five automobiles. Scale is not free."
A Conscientious Sage, Carbon-Counting AI Agent
Before we can shrink the carbon footprint of an LLM, we have to be able to measure it. This section sets the quantitative foundation: how to express training cost in tCO2e from first principles (FLOPs, MFU, PUE, grid carbon intensity), how inference carbon decomposes into independently measurable factors, and which standardized efficiency metrics (FLOPs per token, energy per parameter, model FLOPs utilization, tokens per kWh) let us compare a 7B and a 70B run on an even playing field. Section 55.2 then turns these measurements into mitigation strategies; Section 55.3 turns them into compliance reports.
Prerequisites
This section builds on the pretraining and scaling laws from Section 6.1 and the distributed training material in Chapter 6. Familiarity with GPU hardware (TDP, MFU) from Chapter 9 and with the EU AI Act compliance landscape from Section 53.1 provides helpful context but is not strictly required.
55.1.1 Carbon footprint of LLM training
Every floating-point operation performed during training consumes energy. The total energy consumption of a training run depends on three factors: the number of FLOPs required, the power draw of the hardware performing those FLOPs, and the overhead of the data center infrastructure (cooling, networking, storage). This overhead is captured by the Power Usage Effectiveness (PUE) metric, defined as the ratio of total facility energy to IT equipment energy. A PUE of 1.0 would mean zero overhead; hyperscale data centers typically achieve PUE values between 1.1 and 1.3, while older facilities may exceed 1.5.
The carbon intensity of the electricity grid where training occurs is the final multiplier. Training in Quebec (hydroelectric, ~20 gCO2/kWh) produces roughly one-twentieth the emissions of training in West Virginia (coal-heavy, ~900 gCO2/kWh). This single decision, choosing where to train, can dominate all other optimization efforts combined.
Training GPT-4 consumed an estimated 50 GWh of electricity, enough to power roughly 4,600 U.S. households for an entire year. Yet the inference cost dwarfs training: serving the model to millions of daily users burns through the equivalent of the training energy budget every few weeks. In the economics of LLM carbon footprints, training is the down payment; inference is the mortgage.
The carbon emitted by one inference request decomposes into three independently measurable factors:
For a Llama-3 70B served on H100s with INT8: $e_{\mathrm{tok}} \approx 0.3$ Wh; PUE = 1.1; $I_{\mathrm{grid}} = 400$ gCO2/kWh; a 500-token response emits about $500 \times 0.3 \times 1.1 \times 400 / 1000 = 66$ gCO2. At Bing's reported 100M chat queries per day, that is roughly 6.6 tonnes/day, or 2.4 kilotonnes/year, ignoring prompt processing. Inference dominates the lifetime footprint after roughly $N_{\mathrm{break}} = C_{\mathrm{train}}/(\bar{T}_{\mathrm{out}}\,e_{\mathrm{tok}})$ requests, which Llama-3 70B crosses in under 8 weeks at typical OpenAI/Anthropic traffic. See Patterson et al., 2021 and Luccioni et al., 2024.
Chinchilla scaling (Hoffmann et al., 2022) chooses the parameter count $N$ and pretrain tokens $D$ that minimize training-only loss subject to a fixed compute budget $C = 6ND$. Sardana et al. (2024) show that this is the wrong objective once you also pay inference cost. The total lifetime FLOPs is
where $R$ is the expected request count over the model's deployment lifetime and $\bar{T}_{\mathrm{out}}$ the average tokens-per-request. The inference-aware optimum shifts toward smaller, longer-trained models: for the same loss target, doubling $R$ reduces the optimal $N$ by roughly $2^{1/3}$. This is the formal reason Llama-3 8B and Mistral-7B over-train on far more tokens than Chinchilla recommends; the lifetime carbon footprint actually decreases when you do so. Inference-aware scaling is the link between pretraining choices and operational sustainability.
Algorithm: INFERENCE-AWARE-SCALING (Sardana & Frankle, 2024)
Input: Lifetime FLOP budget C_life,
expected request volume R, avg tokens out T_bar,
Chinchilla loss model L(N, D) = a/N^alpha + b/D^beta + L_irreducible
Output: optimal parameter count N*, training tokens D*
// Inference-cost-per-token = 2N FLOPs (forward only)
// Solve constrained minimization:
// minimize L(N, D)
// subject to 6 N D + 2 N R T_bar <= C_life
Construct Lagrangian:
Lagr(N, D, lambda) = L(N, D) + lambda*(6 N D + 2 N R T_bar - C_life)
Solve dLagr/dN = 0, dLagr/dD = 0, dLagr/dlambda = 0 jointly.
Under Chinchilla form, this yields the modified optimum:
N* satisfies alpha * a / N*^{alpha+1} = lambda * (6 D* + 2 R T_bar)
D* satisfies beta * b / D*^{beta+1} = lambda * 6 N*
// As R grows, the inference term 2 N R T_bar dominates
// and the optimum shifts to (smaller N, larger D)
Return (N*, D*, lambda*)2 N R T_bar (inference FLOPs) to the standard Chinchilla compute budget, then re-derives the optimum via Lagrangian. As deployed request volume R grows, the optimum shifts toward smaller models trained on more tokens, which is why the modern 7B-class checkpoint is both cheaper and lower-carbon to operate at scale.Empirically, for typical foundation-model deployments ($R \sim 10^{11}$ requests over 18 months at $\bar{T}_{\mathrm{out}} = 200$), the Sardana-corrected optimum is $N \approx 0.5\,N_{\mathrm{Chinchilla}}$ with $D \approx 4\,D_{\mathrm{Chinchilla}}$: a much smaller, much-longer-trained model. The carbon argument and the cost argument coincide here, which is why the trend toward 7B-class, multi-trillion-token models is both economically and environmentally driven (Sardana et al., 2024).
# Estimating training carbon footprint from first principles
import dataclasses
@dataclasses.dataclass
class TrainingCarbonEstimate:
"""Estimate CO2 emissions for an LLM training run."""
total_flops: float # Total FLOPs for the training run
gpu_peak_tflops: float # Peak TFLOPS of a single GPU
gpu_utilization: float # Typical MFU (model FLOPs utilization)
gpu_power_watts: float # TDP of a single GPU
num_gpus: int # Number of GPUs used
pue: float # Power Usage Effectiveness of the data center
carbon_intensity: float # gCO2/kWh of the electricity grid
def compute(self) -> dict:
# Effective TFLOPS per GPU
effective_tflops = self.gpu_peak_tflops * self.gpu_utilization
# Total GPU-hours needed
total_gpu_hours = (
self.total_flops / (effective_tflops * 1e12 * 3600)
)
# Wall-clock hours (parallelized across GPUs)
wall_hours = total_gpu_hours / self.num_gpus
# Energy consumption
gpu_energy_kwh = (
self.gpu_power_watts * self.num_gpus * wall_hours / 1000
)
total_energy_kwh = gpu_energy_kwh * self.pue
# Carbon emissions
co2_kg = total_energy_kwh * self.carbon_intensity / 1000
co2_tonnes = co2_kg / 1000
return {
"total_gpu_hours": total_gpu_hours,
"wall_clock_hours": wall_hours,
"gpu_energy_kwh": gpu_energy_kwh,
"total_energy_kwh": total_energy_kwh,
"co2_kg": co2_kg,
"co2_tonnes": co2_tonnes,
}
# Example: Estimating a 70B parameter model training run
# Using Chinchilla-optimal ~1.4T tokens, ~6 FLOPs per token per param
estimate = TrainingCarbonEstimate(
total_flops=70e9 * 1.4e12 * 6, # ~5.88e23 FLOPs
gpu_peak_tflops=989, # H100 SXM BF16 peak
gpu_utilization=0.40, # 40% MFU is typical
gpu_power_watts=700, # H100 SXM TDP
num_gpus=2048, # Typical large training cluster
pue=1.1, # Hyperscale data center
carbon_intensity=400, # gCO2/kWh, US average grid
)
result = estimate.compute()
for key, value in result.items():
print(f"{key}: {value:,.1f}")
The same result in 4 lines with CodeCarbon, which measures actual power draw instead of estimating:
Show code
# pip install codecarbon
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="llm-training")
tracker.start()
# ... your training code here ...
emissions_kg = tracker.stop()
print(f"CO2: {emissions_kg:.4f} kg, Energy: {tracker.final_emissions_data.energy_consumed:.4f} kWh")
# Computing and comparing efficiency metrics across models
def efficiency_metrics(model: dict) -> dict:
"""Compute energy, CO2, and per-parameter efficiency for one training run."""
gpu_energy_kwh = (
model["gpu_tdp_w"] * model["num_gpus"]
* model["training_hours"] / 1000
)
total_energy_kwh = gpu_energy_kwh * model["pue"]
return {
"total_energy_kwh": total_energy_kwh,
"co2_tonnes": total_energy_kwh * model["grid_co2_g_kwh"] / 1e6,
"flops_per_token": 6 * model["params"],
"energy_per_param_wh": (total_energy_kwh * 1000) / model["params"],
"tokens_per_kwh": model["tokens"] / total_energy_kwh,
}
models = [
{
"name": "Llama 2 7B",
"params": 7e9, "tokens": 2e12,
"num_gpus": 256, "training_hours": 480,
"gpu_tdp_w": 400, "pue": 1.1, "grid_co2_g_kwh": 400,
},
{
"name": "Llama 2 70B",
"params": 70e9, "tokens": 2e12,
"num_gpus": 2048, "training_hours": 1720,
"gpu_tdp_w": 400, "pue": 1.1, "grid_co2_g_kwh": 400,
},
]
for m in models:
mx = efficiency_metrics(m)
print(f"\n=== {m['name']} ===")
print(f"Total energy: {mx['total_energy_kwh']:,.0f} kWh")
print(f"CO2 emissions: {mx['co2_tonnes']:,.1f} tonnes")
print(f"FLOPs/token: {mx['flops_per_token']:.2e}")
print(f"Energy/param: {mx['energy_per_param_wh']:.4f} Wh")
print(f"Tokens/kWh: {mx['tokens_per_kwh']:,.0f}")
tokens_per_kwh column collapses from 37 to 2 between Llama-2 7B and 70B, showing that energy efficiency scales sub-linearly with capability and motivating the architectural moves in Section 55.2.Who: Two open-weight frontier labs publishing detailed compute disclosures in 2024-2025.
Situation: DeepSeek-V3 reports a training cost of roughly $5.6M, consuming approximately 2.788M H800 GPU-hours. Meta's Llama 3 model card reports approximately 7.0M tCO2e for the 70B and 405B suite combined.
Estimate: Applying the formula in Code Fragment 55.1.1b with H800 power ~350W and the Chinese eastern-grid carbon intensity of ~580 gCO2/kWh: $2.788 \times 10^6 \times 0.35\,\mathrm{kW} \times 1.1\,\mathrm{PUE} \times 580\,\mathrm{g/kWh} \approx 623\,\mathrm{tCO_2e}$. Using the US grid average instead gives ~430 tCO2e. The often-cited ~1,100 tCO2e estimate from Patterson et al., 2024 incorporates embodied hardware carbon and longer wall-clock from validation runs.
Comparison: Llama 3's 70B-only training is estimated at ~340 tCO2e using Patterson's full accounting, but the suite total of ~7M tCO2e (including 405B) is roughly six-thousand-fold larger than DeepSeek-V3 alone. The MoE-vs-dense architecture choice is the dominant factor (DeepSeek-V3 is a 671B-parameter MoE with only 37B active per token).
Lesson: Even with such large training disparities, both Meta and DeepSeek will see inference dominate cumulative carbon within months of launch (the same break-even logic from the Key Insight above). The decision that matters most for lifetime emissions is therefore architectural (sparsity, parameter efficiency) and deployment-side (quantization, region), not training-budget headlines. See Patterson et al., 2024 for the full lifecycle methodology.
Inference dominates training in cumulative emissions. A model trained once for $10 million may serve billions of inference requests over its lifetime. Patterson et al. (2022) found that for widely deployed models, inference energy can exceed training energy by 10x or more within the first year of deployment. Optimizing inference efficiency (through quantization, distillation, speculative decoding, and caching) therefore has a larger cumulative impact than optimizing training efficiency. The serving optimizations from Section 70.5 are as much about environmental responsibility as they are about cost reduction.
55.1.2 Training efficiency metrics
Comparing the environmental cost of different models requires standardized metrics. Raw energy consumption is not directly comparable between a 7B model trained on 2T tokens and a 70B model trained on 1.4T tokens. The following metrics normalize for model size and training duration, enabling meaningful comparisons.
| Metric | Formula | What It Captures | Typical Range |
|---|---|---|---|
| FLOPs per Token | $\approx 6 \times N$ (forward + backward) | Computational cost per training step | 4.2e10 (7B) to 4.2e11 (70B) |
| Energy per Parameter | $\frac{E_\text{total}}{N}$ | Amortized energy cost of each weight | 0.01 to 0.1 Wh/param |
| CO2 per Experiment | $E_\text{total} \times \text{PUE} \times C_\text{grid}$ | Total carbon for one complete run | 1 to 500+ tonnes |
| Model FLOPs Utilization (MFU) | Actual FLOP/s / Peak FLOP/s | Hardware efficiency; higher is greener | 30% to 55% |
| Tokens per kWh | Total tokens / $E_\text{total}$ | Training data throughput per unit energy | Varies by hardware generation |
The Chinchilla scaling laws (Hoffmann et al., 2022) have an important environmental implication: compute-optimal training allocates the training budget equally between model size and data volume. Many early large models were undertrained relative to the Chinchilla optimum, meaning they consumed more energy than necessary to achieve a given performance level. Following compute-optimal scaling is both a performance strategy and a green AI strategy.
Now that we have a measurement vocabulary, the next step is to act on it. Section 55.2: Reducing the Footprint covers the three layers where mitigation operates (model architecture, hardware substrate, operating choices) and introduces the leaderboards and tracking tools that turn carbon accounting into a daily engineering habit.