Reducing the Footprint

Section 55.2

"Every gigawatt-hour I do not consume is one I do not have to apologize for. Counting watts is cheaper than counting carbon."

SageA Frugal Optimizer, Watt-Counting AI Agent
Big Picture

Diagnosing the problem is half the work; the other half is acting on it. This section walks through the three layers where LLM mitigation operates, model architecture (MoE, distillation, quantization), hardware substrate (GPU choice, data-center PUE), and operating choices (region selection, batch scheduling), then surveys the carbon-tracking tools that turn each layer's wins into auditable numbers on an LLM training run or agent inference pipeline. We close with the rebound effect: the economic phenomenon that erases efficiency gains when product teams reinvest them into more LLM usage rather than fewer emissions.

Fun Fact: The "deploy near hydro" trick
Bar chart showing relative carbon reduction across four mitigation strategies (region selection, MoE vs dense, batch inference, quantization)
Figure 55.2.1: Relative carbon reduction by mitigation lever. Region selection compounds with everything else, which is why it sits first in the playbook even though it is the most political to change.

Google's GCP europe-north1 region (Hamina, Finland) draws over 95 percent of its annual energy from hydroelectric power, with the rest from wind. Routing inference there instead of us-central1 (Iowa, ~40 percent coal) can cut your per-token grid carbon by an order of magnitude with one config change and zero accuracy loss. The hard part is that data-sovereignty rules often force you back to the high-carbon region the moment your users are in the US, which is why "deploy near hydro" is half a trick and half a luxury.

Prerequisites

This section assumes the measurement framework from Section 55.1: Quantifying the Environmental Cost (FLOPs per token, PUE, grid carbon intensity, the inference-aware scaling argument). Familiarity with quantization from Section 9.3 and PEFT methods from Section 17.5 helps interpret the architectural strategies in 55.2.1.

55.2.1 Strategies for reducing environmental footprint

Diagnosing the problem is half the work; the other half is acting on it. Mitigation operates at three layers: the model itself (sparse architectures, distillation, quantization), the hardware substrate (efficient accelerators, optimized data centers), and the operating choices (when and where to schedule training, how to retire stale checkpoints). We treat each layer in turn, beginning with architectural choices that reduce energy at the source.

55.2.1.1 Efficient Architectures

Sparse models, particularly Mixture of Experts (MoE) architectures, activate only a fraction of their total parameters for each input token. A model like Mixtral 8x7B has 47B total parameters but activates only ~13B per token, achieving performance competitive with dense 70B models at a fraction of the inference cost. This sparsity translates directly to energy savings during both training and inference. The architecture deep dive in Chapter 8 covers MoE design in detail.

# Comparing dense vs. MoE energy efficiency
def compare_architectures(
    dense_params: float,
    moe_total_params: float,
    moe_active_params: float,
    tokens: float,
    flops_per_param_token: float = 6.0,
    hardware_tflops: float = 989.0,
    mfu: float = 0.40,
    gpu_power_w: float = 700,
    pue: float = 1.1,
    co2_g_kwh: float = 400,
    ):
    """Compare energy use of dense vs MoE models."""
    results = {}
    for name, active in [("Dense", dense_params), ("MoE", moe_active_params)]:
        total_flops = active * tokens * flops_per_param_token
        effective_tflops = hardware_tflops * mfu * 1e12
        gpu_seconds = total_flops / effective_tflops
        gpu_hours = gpu_seconds / 3600
        energy_kwh = gpu_power_w * gpu_hours / 1000 * pue
        co2_kg = energy_kwh * co2_g_kwh / 1000
        results[name] = {
            "active_params": f"{active/1e9:.0f}B",
            "energy_kwh": f"{energy_kwh:,.0f}",
            "co2_kg": f"{co2_kg:,.0f}",
            }
        dense_e = float(results["Dense"]["energy_kwh"].replace(",", ""))
        moe_e = float(results["MoE"]["energy_kwh"].replace(",", ""))
        savings = (1 - moe_e / dense_e) * 100
        results["energy_savings"] = f"{savings:.1f}%"
        return results
        result = compare_architectures(
            dense_params=70e9, moe_total_params=47e9,
            moe_active_params=13e9, tokens=2e12,
            )
        for key, value in result.items():
            print(f"{key}: {value}")
Output: Dense: {'active_params': '70B', 'energy_kwh': '738,636', 'co2_kg': '295,455'} MoE: {'active_params': '13B', 'energy_kwh': '137,175', 'co2_kg': '54,870'} energy_savings: 81.4%
Code Fragment 55.2.1a: compare_architectures() computes per-token compute (and therefore energy) for a dense model versus a Mixture-of-Experts model with the same total parameter count. The number that matters is moe_active_params: MoE buys quality from total capacity but pays for compute only on the active experts, often 4 to 8x cheaper per inference token.

55.2.1.2 Hardware-Aware Training Decisions

Choosing the data center location based on carbon intensity is one of the highest-leverage decisions you can make. Cloud providers now expose carbon intensity data for their regions. Google Cloud's Carbon Footprint dashboard, AWS's Customer Carbon Footprint Tool, and Azure's Sustainability Calculator all provide per-region emission factors. Scheduling training runs during periods of low carbon intensity (at night in regions with significant solar capacity, or during windy periods in regions with wind farms) can further reduce emissions through a practice called carbon-aware computing.

# Carbon-aware region selection for training jobs
import dataclasses
from typing import Optional
@dataclasses.dataclass
class CloudRegion:
    name: str
    provider: str
    co2_g_kwh: float
    gpu_cost_per_hour: float
    renewable_pct: float
REGIONS = [
    CloudRegion("us-east-1", "AWS", 380, 3.06, 0.30),
    CloudRegion("us-west-2", "AWS", 120, 3.06, 0.72),
    CloudRegion("eu-north-1", "AWS", 25, 3.40, 0.95),
    CloudRegion("ca-central-1", "AWS", 30, 3.06, 0.82),
    CloudRegion("ap-south-1", "AWS", 700, 2.74, 0.18),
    CloudRegion("europe-north1", "GCP", 30, 3.22, 0.92),
    CloudRegion("us-central1", "GCP", 450, 3.22, 0.35),
    CloudRegion("swedencentral", "Azure", 20, 3.40, 0.95),
    ]
def recommend_region(
    gpu_hours: float,
    max_cost: Optional[float] = None,
    ) -> list[dict]:
    """Rank regions by carbon efficiency within budget."""
    candidates = []
    for r in REGIONS:
        cost = r.gpu_cost_per_hour * gpu_hours
        energy_kwh = 700 * gpu_hours / 1000 * 1.1
        co2_kg = energy_kwh * r.co2_g_kwh / 1000
        if max_cost and cost > max_cost:
            continue
            candidates.append({
                "region": f"{r.provider}/{r.name}",
                "co2_kg": round(co2_kg, 1),
                "cost_usd": round(cost, 0),
                "renewable": f"{r.renewable_pct:.0%}",
                })
            return sorted(candidates, key=lambda x: x["co2_kg"])
            for r in recommend_region(10_000, max_cost=50_000)[:5]:
                print(f"{r['region']:25s} CO2: {r['co2_kg']:7.1f} kg"
                    f" Cost: ${r['cost_usd']:,.0f} Renewable: {r['renewable']}")
Output: Azure/swedencentral CO2: 154.0 kg Cost: $34,000 Renewable: 95% AWS/eu-north-1 CO2: 192.5 kg Cost: $34,000 Renewable: 95% GCP/europe-north1 CO2: 231.0 kg Cost: $32,200 Renewable: 92% AWS/ca-central-1 CO2: 231.0 kg Cost: $30,600 Renewable: 82% AWS/us-west-2 CO2: 924.0 kg Cost: $30,600 Renewable: 72%
Code Fragment 55.2.2: A CloudRegion dataclass with grid_carbon_intensity_kg_per_kwh and price_per_gpu_hour, plus a select_region() helper that picks the lowest-carbon region meeting a price ceiling. Region selection routinely cuts a training run's emissions by 5x at near-zero cost change.

55.2.1.3 Distillation and Quantization as Green Alternatives

Rather than training a new large model from scratch, knowledge distillation transfers capabilities from a large "teacher" model to a smaller "student" model. The student trains on the teacher's output distribution rather than raw data, requiring significantly less compute. A distilled 7B student trained to approximate a 70B teacher's behavior typically requires 10 to 100x less compute than training the 70B model originally. The efficient adaptation techniques in Chapter 17 (LoRA, QLoRA) amplify these savings further by fine-tuning only a small fraction of parameters.

Post-training quantization reduces inference energy by representing weights and activations in lower precision (INT8, INT4, or even lower). A model quantized to 4-bit precision uses roughly one-quarter the memory bandwidth and a corresponding fraction of energy per inference request. When deployed at scale across millions of daily requests, the cumulative savings are substantial.

55.2.1.4 Reusing Pretrained Models vs. Training from Scratch

The greenest training run is the one you do not perform. Using a pretrained foundation model and adapting it through fine-tuning, LoRA, or prompt engineering avoids the enormous upfront carbon cost of pretraining. A full LoRA fine-tuning run on a 7B model typically requires 1 to 10 GPU-hours, compared to 100,000+ GPU-hours for pretraining. The decision tree is simple: if an existing model can achieve your target quality with adaptation, do not train from scratch.

Real-World Scenario
Pretraining vs. Fine-Tuning Carbon Comparison

Who: An ML engineering lead at a climate technology startup building a domain-specific language model

Situation: The startup needed a 13B parameter model specialized in climate science terminology. The founding team initially proposed training from scratch on a curated 1T-token corpus of scientific literature to maximize domain accuracy.

Problem: A back-of-envelope estimate revealed that pretraining would require approximately 200,000 A100 GPU-hours and produce roughly 30 tonnes of CO2 at US average grid intensity. For a company whose mission centered on climate impact, this was difficult to justify.

Decision: They chose to fine-tune an existing open-weight 13B model with LoRA (rank 16) on a curated 50K-example domain dataset instead. Full fine-tuning was also evaluated as a middle option at approximately 200 GPU-hours.

Result: LoRA fine-tuning required only 8 A100 GPU-hours and produced approximately 0.001 tonnes of CO2, a 30,000:1 reduction compared to pretraining. Domain accuracy on their benchmark reached 91% of what the team estimated pretraining would achieve.

Lesson: Building on top of existing pretrained models whenever possible yields carbon savings of three to four orders of magnitude with minimal quality loss for most domain-specific applications.

55.2.2 Carbon tracking tools

Carbon accounting for ML experiments requires instrumenting your training pipeline to measure energy consumption in real time. Several open-source tools make this straightforward.

55.2.2.1 CodeCarbon

CodeCarbon is the most widely adopted carbon tracking library for Python ML workflows. It monitors CPU and GPU power draw using hardware-level interfaces (RAPL for Intel CPUs, nvidia-smi for NVIDIA GPUs), combines this with the carbon intensity of your electricity grid (looked up by IP geolocation or manual configuration), and produces a CSV log of emissions per experiment.

# Tracking training emissions with CodeCarbon
from codecarbon import EmissionsTracker
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    TrainingArguments, Trainer,
)
from datasets import load_dataset
tracker = EmissionsTracker(
    project_name="green-ai-finetune",
    output_dir="./carbon_logs",
    log_level="warning",
    measure_power_secs=30,
    tracking_mode="process",
)
tracker.start()
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    fp16=True,
    logging_steps=50,
)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
emissions = tracker.stop()
print(f"\nTotal emissions: {emissions:.6f} kg CO2")
print(f"Equivalent to {emissions / 0.21:.2f} km driven")
print(f"Energy consumed: {tracker.final_emissions_data.energy_consumed:.4f} kWh")
Output: Total emissions: 0.042371 kg CO2 Equivalent to 0.20 km driven Energy consumed: 0.1284 kWh
Code Fragment 55.2.3: CodeCarbon's EmissionsTracker wraps a Hugging Face Trainer so every GPU sample taken via NVML is converted into a live CO2eq estimate. The output emissions.csv file becomes part of the model card, turning sustainability claims into auditable numbers.

55.2.2.2 ML CO2 Impact Calculator

For quick estimates without instrumenting your code, the ML CO2 Impact online calculator (mlco2.github.io/impact) accepts your hardware type, training duration, and cloud provider/region, then returns estimated emissions. This is useful for planning and for retrospectively estimating the footprint of experiments where you did not run a tracker.

55.2.2.3 Climatiq, Boavizta, and other carbon-accounting providers

CodeCarbon is the de-facto open-source default, but two commercial-grade alternatives are worth knowing because they cover what CodeCarbon does not:

Choose CodeCarbon when you need a single-import dropin; layer Climatiq or Boavizta on top when your sustainability officer needs a defensible methodology trail. All three publish their emission factors openly, which is a precondition for the EU AI Act disclosures covered in Section 55.3.

Library Shortcut: ML.energy Leaderboard (2025)

For inference-side numbers without instrumenting anything, the ML.energy Leaderboard (ml.energy/leaderboard) publishes per-model, per-token energy measurements gathered on a standardized H100/A100 testbed. The 2025 snapshot ranks Llama-3, Mistral, Mixtral, DeepSeek-V3, and Qwen variants on a tokens-per-joule axis, with the underlying measurement scripts open-sourced so you can replicate against your own hardware. Pull the live ranking and pick the most efficient model in two lines:

Show code
# Fetch the public ML.energy leaderboard JSON and sort by joules/token
import requests
rows = requests.get("https://ml.energy/leaderboard/api/v1/results.json").json()
best = sorted(rows, key=lambda r: r["j_per_token"])[:5] # top-5 lowest joules/token
for r in best:
    print(f"{r['model']:25s} {r['j_per_token']:6.3f} J/tok")
Code Fragment 55.2.4: Pulling the ML.energy leaderboard and sorting by joules-per-token. The output column drives model selection in the same way a model-card BLEU score does, except the optimization target is now energy rather than accuracy.

This is the simplest way to do model selection with carbon as a first-class metric: instead of running your own benchmark, look up the model card and read the joules-per-token directly.

55.2.2.4 Experiment-Level Carbon Accounting

Integrating carbon tracking into your experiment management system (Weights and Biases, MLflow, or Neptune) allows you to compare the carbon cost of different approaches alongside their performance metrics. This enables Pareto-optimal decisions: choosing the model or hyperparameter configuration that achieves the best quality per unit of carbon emitted.

# Logging carbon metrics alongside training metrics in W&B
import wandb
from codecarbon import EmissionsTracker
def carbon_aware_training_loop(config: dict):
    """Training loop with integrated carbon tracking."""
    wandb.init(project="green-ai", config=config)
    tracker = EmissionsTracker(
        project_name=wandb.run.name, log_level="error",
        )
    tracker.start()
    for epoch in range(config["epochs"]):
        train_loss = train_one_epoch()
        val_loss = evaluate()
        current = tracker._prepare_emissions_data()
        wandb.log({
            "train/loss": train_loss,
            "val/loss": val_loss,
            "carbon/co2_kg": current.emissions,
            "carbon/energy_kwh": current.energy_consumed,
            "carbon/loss_per_co2": (
            val_loss / max(current.emissions, 1e-9)
            ),
            })
        emissions = tracker.stop()
        wandb.summary["total_co2_kg"] = emissions
        wandb.finish()
Code Fragment 55.2.5: A wandb.log() call inside the training loop logs CodeCarbon's per-step emissions alongside loss and learning rate. Plotting CO2 next to loss surfaces the carbon cost of late-epoch overfitting and makes "spend less on training" a quantifiable optimization target.

55.2.3 The rebound effect

In economics, the Jevons paradox (also called the rebound effect) observes that improvements in energy efficiency often lead to increased total energy consumption because the reduced cost per unit encourages greater usage. This pattern is playing out in the AI industry. Each generation of hardware is more energy-efficient per FLOP, yet total AI energy consumption continues to grow because the efficiency gains are reinvested into training larger models, running more experiments, and serving more inference requests.

Between 2020 and 2025, GPU energy efficiency (measured in FLOP/s per watt) improved by roughly 3x from A100 to H100/H200. During the same period, the total compute used in the largest training runs grew by approximately 10x. The net result is that total energy consumption for frontier model training increased despite hardware efficiency improvements. This pattern suggests that efficiency alone will not solve the environmental challenge; it must be combined with deliberate choices about how much compute to use.

Key Insight

Efficiency without restraint is not sustainability. If every 2x improvement in hardware efficiency is met with a 4x increase in model size, total energy consumption doubles with each generation. The Green AI movement (Schwartz et al., 2020) argues that the research community should treat compute efficiency as a first-class evaluation metric alongside accuracy. Reporting the FLOPs, energy, and carbon cost of experiments, not just their accuracy, creates incentives for developing methods that achieve strong results with less compute. Some conferences (notably NeurIPS and EMNLP) now encourage or require compute and carbon reporting in paper submissions.

What's Next

Mitigation strategies are only half of the answer; the other half is operating under disclosure obligations that increasingly carry legal force. Section 55.3: Operating Under Compliance covers experiment-level energy profiling, the EU AI Act's Article 53 GPAI environmental disclosure requirements, and the practical Green-AI checklist that ties all three sections together.

Further Reading

Mitigation and Green AI

Schwartz, R. et al. (2020). "Green AI." Communications of the ACM, 63(12), 54-63. Proposes the concept of Green AI, arguing that the field should prioritize computational efficiency alongside accuracy. Introduces reporting standards for compute costs that have influenced conference submission requirements.
CodeCarbon. "Track and Reduce CO2 Emissions from Compute." GitHub: mlco2/codecarbon. Open-source Python library that automatically tracks energy consumption and carbon emissions during model training and inference. Drop-in integration for PyTorch and TensorFlow workflows, enabling transparent emissions reporting.
Climatiq. "Carbon Emissions Calculation API." Commercial API exposing audited emission factors for compute, networking, and embodied hardware, including locational-marginal and time-of-use grid intensities.
Boavizta. "Open Lifecycle Carbon Data for Computing Hardware." Non-profit consortium publishing open lifecycle data for compute hardware; the boagent daemon couples power-meter telemetry with an embodied-carbon database.
ML.energy. "ML.energy Leaderboard." Open leaderboard ranking popular LLMs by tokens-per-joule on a standardized H100/A100 testbed. Underlying measurement scripts open-sourced.