Quantifying the Environmental Cost

Section 55.1

"The carbon footprint of training a single large model can exceed the lifetime emissions of five automobiles. Scale is not free."

SageA Conscientious Sage, Carbon-Counting AI Agent
Big Picture

Before we can shrink the carbon footprint of an LLM, we have to be able to measure it. This section sets the quantitative foundation: how to express training cost in tCO2e from first principles (FLOPs, MFU, PUE, grid carbon intensity), how inference carbon decomposes into independently measurable factors, and which standardized efficiency metrics (FLOPs per token, energy per parameter, model FLOPs utilization, tokens per kWh) let us compare a 7B and a 70B run on an even playing field. Section 55.2 then turns these measurements into mitigation strategies; Section 55.3 turns them into compliance reports.

Prerequisites

This section builds on the pretraining and scaling laws from Section 6.1 and the distributed training material in Chapter 6. Familiarity with GPU hardware (TDP, MFU) from Chapter 9 and with the EU AI Act compliance landscape from Section 53.1 provides helpful context but is not strictly required.

A large factory with smokestacks shaped like GPU chips producing cloud-shaped carbon footprints, while a scientist robot measures the emissions.
Figure 55.1.1: Scale is not free. Every floating-point operation consumes energy, and the EU AI Act now requires environmental impact disclosures for general-purpose AI models.

55.1.1 Carbon footprint of LLM training

Every floating-point operation performed during training consumes energy. The total energy consumption of a training run depends on three factors: the number of FLOPs required, the power draw of the hardware performing those FLOPs, and the overhead of the data center infrastructure (cooling, networking, storage). This overhead is captured by the Power Usage Effectiveness (PUE) metric, defined as the ratio of total facility energy to IT equipment energy. A PUE of 1.0 would mean zero overhead; hyperscale data centers typically achieve PUE values between 1.1 and 1.3, while older facilities may exceed 1.5.

The carbon intensity of the electricity grid where training occurs is the final multiplier. Training in Quebec (hydroelectric, ~20 gCO2/kWh) produces roughly one-twentieth the emissions of training in West Virginia (coal-heavy, ~900 gCO2/kWh). This single decision, choosing where to train, can dominate all other optimization efforts combined.

Fun Fact

Training GPT-4 consumed an estimated 50 GWh of electricity, enough to power roughly 4,600 U.S. households for an entire year. Yet the inference cost dwarfs training: serving the model to millions of daily users burns through the equivalent of the training energy budget every few weeks. In the economics of LLM carbon footprints, training is the down payment; inference is the mortgage.

Key Insight: Inference Carbon as a Three-Factor Product

The carbon emitted by one inference request decomposes into three independently measurable factors:

$$\mathrm{CO_2}_{\text{inference}}\;[\mathrm{g}] \;=\; \underbrace{T_{\mathrm{out}}}_{\substack{\text{tokens}\\\text{generated}}} \;\cdot\; \underbrace{e_{\mathrm{tok}}}_{\substack{\text{energy}\\\text{per token (Wh)}}} \;\cdot\; \underbrace{\mathrm{PUE}}_{\substack{\text{data-center}\\\text{overhead}}} \;\cdot\; \underbrace{I_{\mathrm{grid}}}_{\substack{\text{grid intensity}\\\mathrm{(gCO_2/kWh)}}} \;/\; 1000.$$

For a Llama-3 70B served on H100s with INT8: $e_{\mathrm{tok}} \approx 0.3$ Wh; PUE = 1.1; $I_{\mathrm{grid}} = 400$ gCO2/kWh; a 500-token response emits about $500 \times 0.3 \times 1.1 \times 400 / 1000 = 66$ gCO2. At Bing's reported 100M chat queries per day, that is roughly 6.6 tonnes/day, or 2.4 kilotonnes/year, ignoring prompt processing. Inference dominates the lifetime footprint after roughly $N_{\mathrm{break}} = C_{\mathrm{train}}/(\bar{T}_{\mathrm{out}}\,e_{\mathrm{tok}})$ requests, which Llama-3 70B crosses in under 8 weeks at typical OpenAI/Anthropic traffic. See Patterson et al., 2021 and Luccioni et al., 2024.

Key Insight
Inference-Aware Scaling (the Sardana-Frankle Correction)

Chinchilla scaling (Hoffmann et al., 2022) chooses the parameter count $N$ and pretrain tokens $D$ that minimize training-only loss subject to a fixed compute budget $C = 6ND$. Sardana et al. (2024) show that this is the wrong objective once you also pay inference cost. The total lifetime FLOPs is

$$C_{\mathrm{life}}(N, D, R) \;=\; \underbrace{6ND}_{\text{training}} \;+\; \underbrace{2\,N\,R\,\bar{T}_{\mathrm{out}}}_{\text{inference: 2N FLOPs/token}},$$

where $R$ is the expected request count over the model's deployment lifetime and $\bar{T}_{\mathrm{out}}$ the average tokens-per-request. The inference-aware optimum shifts toward smaller, longer-trained models: for the same loss target, doubling $R$ reduces the optimal $N$ by roughly $2^{1/3}$. This is the formal reason Llama-3 8B and Mistral-7B over-train on far more tokens than Chinchilla recommends; the lifetime carbon footprint actually decreases when you do so. Inference-aware scaling is the link between pretraining choices and operational sustainability.

Algorithm 55.1.1: Inference-Aware Compute-Optimal Allocation
Algorithm: INFERENCE-AWARE-SCALING (Sardana & Frankle, 2024)
Input:  Lifetime FLOP budget C_life,
        expected request volume R, avg tokens out T_bar,
        Chinchilla loss model L(N, D) = a/N^alpha + b/D^beta + L_irreducible
Output: optimal parameter count N*, training tokens D*

  // Inference-cost-per-token = 2N FLOPs (forward only)
  // Solve constrained minimization:
  //   minimize L(N, D)
  //   subject to 6 N D + 2 N R T_bar <= C_life

  Construct Lagrangian:
    Lagr(N, D, lambda) = L(N, D) + lambda*(6 N D + 2 N R T_bar - C_life)

  Solve dLagr/dN = 0, dLagr/dD = 0, dLagr/dlambda = 0 jointly.
  Under Chinchilla form, this yields the modified optimum:
    N* satisfies   alpha * a / N*^{alpha+1} = lambda * (6 D* + 2 R T_bar)
    D* satisfies   beta  * b / D*^{beta+1}  = lambda * 6 N*

  // As R grows, the inference term 2 N R T_bar dominates
  // and the optimum shifts to (smaller N, larger D)

  Return (N*, D*, lambda*)
Code Fragment 55.1.1a: The inference-aware scaling algorithm adds the term 2 N R T_bar (inference FLOPs) to the standard Chinchilla compute budget, then re-derives the optimum via Lagrangian. As deployed request volume R grows, the optimum shifts toward smaller models trained on more tokens, which is why the modern 7B-class checkpoint is both cheaper and lower-carbon to operate at scale.

Empirically, for typical foundation-model deployments ($R \sim 10^{11}$ requests over 18 months at $\bar{T}_{\mathrm{out}} = 200$), the Sardana-corrected optimum is $N \approx 0.5\,N_{\mathrm{Chinchilla}}$ with $D \approx 4\,D_{\mathrm{Chinchilla}}$: a much smaller, much-longer-trained model. The carbon argument and the cost argument coincide here, which is why the trend toward 7B-class, multi-trillion-token models is both economically and environmentally driven (Sardana et al., 2024).

# Estimating training carbon footprint from first principles
import dataclasses
@dataclasses.dataclass
class TrainingCarbonEstimate:
    """Estimate CO2 emissions for an LLM training run."""
    total_flops: float # Total FLOPs for the training run
    gpu_peak_tflops: float # Peak TFLOPS of a single GPU
    gpu_utilization: float # Typical MFU (model FLOPs utilization)
    gpu_power_watts: float # TDP of a single GPU
    num_gpus: int # Number of GPUs used
    pue: float # Power Usage Effectiveness of the data center
    carbon_intensity: float # gCO2/kWh of the electricity grid
    def compute(self) -> dict:
        # Effective TFLOPS per GPU
        effective_tflops = self.gpu_peak_tflops * self.gpu_utilization
        # Total GPU-hours needed
        total_gpu_hours = (
            self.total_flops / (effective_tflops * 1e12 * 3600)
            )
        # Wall-clock hours (parallelized across GPUs)
        wall_hours = total_gpu_hours / self.num_gpus
        # Energy consumption
        gpu_energy_kwh = (
            self.gpu_power_watts * self.num_gpus * wall_hours / 1000
            )
        total_energy_kwh = gpu_energy_kwh * self.pue
        # Carbon emissions
        co2_kg = total_energy_kwh * self.carbon_intensity / 1000
        co2_tonnes = co2_kg / 1000
        return {
            "total_gpu_hours": total_gpu_hours,
            "wall_clock_hours": wall_hours,
            "gpu_energy_kwh": gpu_energy_kwh,
            "total_energy_kwh": total_energy_kwh,
            "co2_kg": co2_kg,
            "co2_tonnes": co2_tonnes,
            }
# Example: Estimating a 70B parameter model training run
# Using Chinchilla-optimal ~1.4T tokens, ~6 FLOPs per token per param
estimate = TrainingCarbonEstimate(
    total_flops=70e9 * 1.4e12 * 6, # ~5.88e23 FLOPs
    gpu_peak_tflops=989, # H100 SXM BF16 peak
    gpu_utilization=0.40, # 40% MFU is typical
    gpu_power_watts=700, # H100 SXM TDP
    num_gpus=2048, # Typical large training cluster
    pue=1.1, # Hyperscale data center
    carbon_intensity=400, # gCO2/kWh, US average grid
    )
result = estimate.compute()
for key, value in result.items():
    print(f"{key}: {value:,.1f}")
Output: total_gpu_hours: 414,141.4 wall_clock_hours: 202.2 gpu_energy_kwh: 289,898,989.9 total_energy_kwh: 318,888,888.9 co2_kg: 127,555.6 co2_tonnes: 127.6
Code Fragment 55.1.1b: A TrainingCarbonEstimate dataclass that combines GPU-hours, average power draw, datacenter PUE, and regional grid carbon intensity into a single kg-CO2eq number. Stepping through it forces explicit assumptions: the same training run can vary by 10x in stated emissions depending on which factors you ignore.
Library Shortcut: CodeCarbon for Carbon Footprint Tracking

The same result in 4 lines with CodeCarbon, which measures actual power draw instead of estimating:

Show code
# pip install codecarbon
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="llm-training")
tracker.start()
# ... your training code here ...
emissions_kg = tracker.stop()
print(f"CO2: {emissions_kg:.4f} kg, Energy: {tracker.final_emissions_data.energy_consumed:.4f} kWh")
Code Fragment 55.1.1c: Same accounting in 4 lines using CodeCarbon. The tracker uses NVML and Intel RAPL to read live power-draw counters during training rather than estimating from TDP, so the numbers reflect what your specific cluster actually consumes (not what the spec sheet predicts at peak load).
Output: === Llama 2 7B === Total energy: 53,760,000 kWh CO2 emissions: 21,504.0 tonnes FLOPs/token: 4.20e+10 Energy/param: 7,680.0000 Wh Tokens/kWh: 37 === Llama 2 70B === Total energy: 985,497,600 kWh CO2 emissions: 394,199.0 tonnes FLOPs/token: 4.20e+11 Energy/param: 14,078.5371 Wh Tokens/kWh: 2
# Computing and comparing efficiency metrics across models
def efficiency_metrics(model: dict) -> dict:
    """Compute energy, CO2, and per-parameter efficiency for one training run."""
    gpu_energy_kwh = (
        model["gpu_tdp_w"] * model["num_gpus"]
        * model["training_hours"] / 1000
    )
    total_energy_kwh = gpu_energy_kwh * model["pue"]
    return {
        "total_energy_kwh": total_energy_kwh,
        "co2_tonnes": total_energy_kwh * model["grid_co2_g_kwh"] / 1e6,
        "flops_per_token": 6 * model["params"],
        "energy_per_param_wh": (total_energy_kwh * 1000) / model["params"],
        "tokens_per_kwh": model["tokens"] / total_energy_kwh,
    }

models = [
    {
        "name": "Llama 2 7B",
        "params": 7e9, "tokens": 2e12,
        "num_gpus": 256, "training_hours": 480,
        "gpu_tdp_w": 400, "pue": 1.1, "grid_co2_g_kwh": 400,
    },
    {
        "name": "Llama 2 70B",
        "params": 70e9, "tokens": 2e12,
        "num_gpus": 2048, "training_hours": 1720,
        "gpu_tdp_w": 400, "pue": 1.1, "grid_co2_g_kwh": 400,
    },
]
for m in models:
    mx = efficiency_metrics(m)
    print(f"\n=== {m['name']} ===")
    print(f"Total energy: {mx['total_energy_kwh']:,.0f} kWh")
    print(f"CO2 emissions: {mx['co2_tonnes']:,.1f} tonnes")
    print(f"FLOPs/token: {mx['flops_per_token']:.2e}")
    print(f"Energy/param: {mx['energy_per_param_wh']:.4f} Wh")
    print(f"Tokens/kWh: {mx['tokens_per_kwh']:,.0f}")
Code Fragment 55.1.1d: Normalizes energy and CO2 by parameter count and token throughput so that a 7B and a 70B run can be compared fairly. The tokens_per_kwh column collapses from 37 to 2 between Llama-2 7B and 70B, showing that energy efficiency scales sub-linearly with capability and motivating the architectural moves in Section 55.2.
Real-World Scenario
DeepSeek-V3 vs Llama 3 70B Training Carbon

Who: Two open-weight frontier labs publishing detailed compute disclosures in 2024-2025.

Situation: DeepSeek-V3 reports a training cost of roughly $5.6M, consuming approximately 2.788M H800 GPU-hours. Meta's Llama 3 model card reports approximately 7.0M tCO2e for the 70B and 405B suite combined.

Estimate: Applying the formula in Code Fragment 55.1.1b with H800 power ~350W and the Chinese eastern-grid carbon intensity of ~580 gCO2/kWh: $2.788 \times 10^6 \times 0.35\,\mathrm{kW} \times 1.1\,\mathrm{PUE} \times 580\,\mathrm{g/kWh} \approx 623\,\mathrm{tCO_2e}$. Using the US grid average instead gives ~430 tCO2e. The often-cited ~1,100 tCO2e estimate from Patterson et al., 2024 incorporates embodied hardware carbon and longer wall-clock from validation runs.

Comparison: Llama 3's 70B-only training is estimated at ~340 tCO2e using Patterson's full accounting, but the suite total of ~7M tCO2e (including 405B) is roughly six-thousand-fold larger than DeepSeek-V3 alone. The MoE-vs-dense architecture choice is the dominant factor (DeepSeek-V3 is a 671B-parameter MoE with only 37B active per token).

Lesson: Even with such large training disparities, both Meta and DeepSeek will see inference dominate cumulative carbon within months of launch (the same break-even logic from the Key Insight above). The decision that matters most for lifetime emissions is therefore architectural (sparsity, parameter efficiency) and deployment-side (quantization, region), not training-budget headlines. See Patterson et al., 2024 for the full lifecycle methodology.

Key Insight

Inference dominates training in cumulative emissions. A model trained once for $10 million may serve billions of inference requests over its lifetime. Patterson et al. (2022) found that for widely deployed models, inference energy can exceed training energy by 10x or more within the first year of deployment. Optimizing inference efficiency (through quantization, distillation, speculative decoding, and caching) therefore has a larger cumulative impact than optimizing training efficiency. The serving optimizations from Section 70.5 are as much about environmental responsibility as they are about cost reduction.

55.1.2 Training efficiency metrics

Comparing the environmental cost of different models requires standardized metrics. Raw energy consumption is not directly comparable between a 7B model trained on 2T tokens and a 70B model trained on 1.4T tokens. The following metrics normalize for model size and training duration, enabling meaningful comparisons.

Table 55.1.1e: Training Efficiency Metrics (as of 2026).
MetricFormulaWhat It CapturesTypical Range
FLOPs per Token $\approx 6 \times N$ (forward + backward) Computational cost per training step 4.2e10 (7B) to 4.2e11 (70B)
Energy per Parameter $\frac{E_\text{total}}{N}$ Amortized energy cost of each weight 0.01 to 0.1 Wh/param
CO2 per Experiment $E_\text{total} \times \text{PUE} \times C_\text{grid}$ Total carbon for one complete run 1 to 500+ tonnes
Model FLOPs Utilization (MFU) Actual FLOP/s / Peak FLOP/s Hardware efficiency; higher is greener 30% to 55%
Tokens per kWh Total tokens / $E_\text{total}$ Training data throughput per unit energy Varies by hardware generation

The Chinchilla scaling laws (Hoffmann et al., 2022) have an important environmental implication: compute-optimal training allocates the training budget equally between model size and data volume. Many early large models were undertrained relative to the Chinchilla optimum, meaning they consumed more energy than necessary to achieve a given performance level. Following compute-optimal scaling is both a performance strategy and a green AI strategy.

What's Next

Now that we have a measurement vocabulary, the next step is to act on it. Section 55.2: Reducing the Footprint covers the three layers where mitigation operates (model architecture, hardware substrate, operating choices) and introduces the leaderboards and tracking tools that turn carbon accounting into a daily engineering habit.

Further Reading

Foundational measurement papers

Strubell, E., Ganesh, A., and McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP." Proceedings of ACL 2019. Landmark study quantifying the carbon emissions of training large NLP models, finding that training a single Transformer can emit as much CO2 as five cars over their lifetimes. Catalyzed the green AI movement and remains a key reference for environmental impact discussions.
Patterson, D. et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350. Comprehensive measurement of carbon emissions from training large neural networks at Google, with practical recommendations for reducing environmental impact. Provides the empirical basis for many of this section's quantitative claims, and the updated 2024 version covers DeepSeek-V3 and Llama 3 lifecycle accounting.
Sardana, N. and Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448. Derives the inference-aware optimum that shifts compute allocation toward smaller, longer-trained models. The formal basis for the 7B-class checkpoint trend.
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556 (Chinchilla). Demonstrates that most large models are over-parameterized relative to their training data, showing that compute-optimal training can achieve better performance with less energy by balancing model size and data volume.
Lacoste, A. et al. (2019). "Quantifying the Carbon Emissions of Machine Learning." arXiv:1910.09700. Proposes a standardized methodology for estimating ML carbon emissions based on hardware, runtime, and energy grid carbon intensity. The methodology underpins several carbon tracking tools referenced later in this chapter.