"A model card without an emissions column is no longer a complete model card; the regulator counts those numbers even when the engineer does not."
An Audit-Ready Auditor, Compliance-Conscious AI Agent
Measurement (Section 55.1) and mitigation (Section 55.2) are technical disciplines. This section turns them into operating practice: how to profile energy and carbon at the experiment level using tokens-per-joule, which inference strategies actually move the needle once the model is in production, what the EU AI Act's Article 53 GPAI provisions require of model providers in 2026, and a 10-point Green-AI checklist that codifies the largest wins (the 4x region swap, INT4 quantization, semantic caching, MFU tuning) into a single deployment review.
The Article 53 GPAI obligations that take effect in 2025-2027 are almost entirely disclosure rules: keep a technical file, publish a copyright-policy summary, report energy use. The first enforcement actions will not be over a model behaving badly; they will be over a vendor failing to publish the right document by the right date. Compliance teams who treat the Act like a safety regulation will be surprised; those who treat it like SOX-for-models will be ready.
Prerequisites
This section assumes the measurement vocabulary from Section 55.1: Quantifying the Environmental Cost and the mitigation strategies and carbon-tracking tools from Section 55.2: Reducing the Footprint. Familiarity with the EU AI Act compliance landscape from Section 53.1 is helpful for the regulatory discussion in 55.3.2.
55.3.1 Energy and carbon profiling at the experiment level
Moving beyond aggregate estimates, modern tooling enables per-experiment and per-inference carbon profiling. CodeCarbon (introduced in Section 55.2.2.1) provides the instrumentation layer, but two additional resources help contextualize your measurements: the ML.ENERGY Leaderboard, which ranks popular LLMs by energy consumption per output token across standardized benchmarks, and the MELODI benchmark, which evaluates models on a joint accuracy-per-joule metric. Together these tools let teams make informed model selection decisions that balance quality against environmental cost.
55.3.1.1 Tokens-per-Joule as an Efficiency Metric
While tokens-per-kWh is useful for training, inference efficiency is better captured at finer granularity. Tokens per joule (T/J) measures how many output tokens a serving system produces per joule of energy consumed. This metric normalizes across hardware, batch sizes, and request patterns, making it possible to compare a quantized 7B model on consumer GPUs against a dense 70B model on H100 clusters. Higher T/J is better.
| Model | Parameters | Quantization | Energy per 1K tokens | CO2 per 1K tokens (US avg grid) | Tokens/Joule |
|---|---|---|---|---|---|
| Llama-3 8B | 8B | INT4 (GPTQ) | ~0.4 Wh | ~0.16 g | ~700 |
| Llama-3 8B | 8B | FP16 | ~1.0 Wh | ~0.40 g | ~280 |
| Llama-3 70B | 70B | INT4 (AWQ) | ~3.2 Wh | ~1.28 g | ~87 |
| Llama-3 70B | 70B | FP16 | ~8.0 Wh | ~3.20 g | ~35 |
| Mixtral 8x7B (MoE) | 47B (13B active) | FP16 | ~1.8 Wh | ~0.72 g | ~155 |
| GPT-4o (API) | Undisclosed | Provider-managed | ~5.0 Wh (est.) | ~2.00 g (est.) | ~56 (est.) |
Note that these figures are approximate and vary with hardware, batch size, sequence length, and serving framework. The key takeaway is the magnitude of difference: INT4 quantization roughly doubles tokens-per-joule, and smaller models can be 10 to 20x more energy-efficient per token than their larger counterparts.
55.3.1.2 Green Inference Strategies
Five strategies reduce inference energy consumption without retraining:
- Model distillation. A 7B student distilled from a 70B teacher delivers 80 to 90% of the teacher's quality at one-tenth the energy per request. See Section 17.5.
- Post-training quantization. INT4/INT8 quantization (via GPTQ, AWQ, or bitsandbytes) cuts memory bandwidth and energy by 2 to 4x with minimal quality degradation. See Section 9.3.
- Batch scheduling. Accumulating requests into larger batches improves GPU utilization. Continuous batching (vLLM, TGI) keeps the GPU occupied rather than idle between requests.
- Semantic caching. Caching responses to frequently asked or semantically similar queries eliminates redundant computation entirely. See Section 70.5.
- Region-aware routing. Route inference requests to data center regions with the lowest real-time carbon intensity. Cloud providers expose carbon-intensity APIs that enable dynamic routing decisions.
The efficiency paradox (Jevons paradox) in AI. Every green inference strategy listed above reduces the cost per token. Historically, lower per-unit costs lead to dramatically higher total usage, often overwhelming the efficiency gains. If quantizing your model cuts cost by 4x, and your product team responds by serving 10x more requests (longer conversations, more features, broader rollout), total energy consumption increases 2.5x despite the per-token improvement. Sustainable AI requires pairing technical efficiency with organizational discipline: setting energy budgets, tracking total consumption (not just per-token efficiency), and making conscious decisions about how efficiency gains are reinvested.
55.3.2 Policy perspectives and regulatory requirements
The EU AI Act (covered in detail in Section 47.1) includes environmental disclosure requirements for providers of general-purpose AI (GPAI) models. Article 53 requires GPAI providers to document the energy consumption of model training, and for models classified as having systemic risk, to report the energy consumption of inference as well. The European AI Office has indicated that standardized carbon reporting templates will be published as part of the implementing regulations.
Beyond the EU, several other jurisdictions are developing AI environmental disclosure requirements. The US Executive Order on AI (October 2023) called for research into AI's environmental impact. The UK's AI Safety Institute has included energy consumption in its model evaluation framework. China's interim measures for generative AI require providers to "adopt measures to prevent environmental damage." For organizations operating globally, implementing carbon tracking now prepares you for regulatory requirements that are converging across jurisdictions.
Article 53(1)(a)-(b) of Regulation (EU) 2024/1689 requires every GPAI model provider placed on the EU market to maintain technical documentation that includes estimated energy consumption during training. The Commission's Implementing Regulation 2025 tightened the timeline:
- August 2, 2025: GPAI obligations under Article 53 became enforceable for models placed on the market after that date.
- August 2, 2026: GPAI models with systemic risk (Article 51, the >1025 FLOP threshold) must additionally report aggregate per-query inference energy and carbon, with a standardized template published by the European AI Office.
- August 2, 2027: Pre-existing GPAI models (those on the market before August 2025) must be brought into compliance, including retroactive training-energy estimates.
The European AI Office's GPAI Code of Practice (signed by Anthropic, Google, OpenAI, and Mistral in 2025) specifies that disclosures must cite a methodology, and CodeCarbon, Climatiq, or Boavizta outputs (Section 55.2.2.3) are considered acceptable evidence when accompanied by a PUE assumption and grid-intensity source. Non-compliance carries fines of up to 3% of global annual turnover or €15M, whichever is higher (Article 101). See Regulation (EU) 2024/1689, Article 53.
Who: A sustainability officer and an MLOps engineer at a European AI company providing GPAI models to enterprise clients
Situation: The EU AI Act's GPAI provisions required the company to include energy consumption and carbon emission estimates in their model technical documentation. They had no existing carbon tracking infrastructure.
Problem: Training runs spanned multiple GPU clusters across two cloud regions, and inference serving added a separate, ongoing emissions stream. The sustainability officer had no way to produce the per-model-version emission figures that the technical documentation required.
Decision: They built an automated five-stage pipeline: (1) instrument all training runs with CodeCarbon, (2) log per-experiment emissions to a central experiment tracker, (3) aggregate emissions by model version for the GPAI technical documentation, (4) sample inference power draw across representative request distributions and extrapolate to total serving volume, and (5) report both Scope 2 (purchased electricity) and Scope 3 (hardware manufacturing, data center construction) emissions. The pipeline was integrated into the CI/CD workflow discussed in Section 42.3.
Result: Carbon reports were generated automatically for each model release. The first complete report took 3 weeks of engineering to build; subsequent reports required zero manual effort. The company used the data to shift 60% of training jobs to a Nordic data center, reducing per-model emissions by 14x.
Lesson: Automating carbon reporting into the CI/CD pipeline turns a regulatory obligation into actionable data that drives real emission reductions.
55.3.3 Practical checklist for Green AI
Patterson et al. (Google, 2022, "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink") published the carbon intensity of Google Cloud's training regions, and the gap is brutal: training the same model in us-central1 (Iowa, mostly coal and natural gas at 0.54 kg CO2/kWh in 2022) vs europe-north1 (Finland, 80 percent hydro and wind at 0.13 kg CO2/kWh) is a 4.2x difference in emissions for identical FLOPs, identical wall-clock, identical accuracy. The training bill in dollars is essentially the same because GPU rental is region-neutral; only the carbon-accounting column on the model card changes. The lesson that the checklist below codifies as step 2: "where you train" is a one-line change in your Terraform config that moves the carbon column by 4x. Most teams have never queried what their region's grid mix actually is, which is why the same checklist item is also the largest single-step win available.
- Reuse before retraining. Check if an existing pretrained model meets your needs. Use prompt engineering, LoRA, or QLoRA before considering full fine-tuning.
- Choose a green region. Select data center locations with low carbon intensity grids. Prefer regions with high renewable energy percentages.
- Use efficient architectures. Prefer MoE, sparse attention, or smaller models where quality requirements allow.
- Quantize for inference. Deploy models in INT4 or INT8 precision. The quality loss is often negligible; the energy savings are significant.
- Track emissions. Instrument every training run with CodeCarbon. Log emissions alongside accuracy metrics.
- Report FLOPs. Include compute cost in experiment reports. Enable Pareto-optimal model selection.
- Avoid unnecessary experiments. Use learning rate finders, small-scale ablations, and early stopping to reduce wasted compute.
- Optimize MFU. Higher model FLOPs utilization means less idle power draw. Profile and optimize data loading and communication overhead.
- Cache inference results. Semantic caching (described in Section 70.5) eliminates redundant computation.
- Prepare for regulation. Build carbon reporting into your model documentation pipeline for EU AI Act compliance.
Readers often focus exclusively on training costs and overlook inference costs. While training a large model is energy-intensive, inference at scale (millions of requests per day) can cumulatively exceed training costs over the model's lifetime. A model trained once but served for a year may consume 10x more energy in inference than in training. Always include inference projections in environmental impact assessments.
Exercises
The carbon footprint of an LLM has three lifecycle phases. (a) Name them and rank them by typical CO2 contribution for a frontier model deployed at scale. (b) Why is the answer different for a research lab versus a major API provider? (c) What is the single number most commonly missing from public carbon disclosures, and what does it understate?
Answer Sketch
(a) Phases: pretraining, fine-tuning/post-training, and inference. For a deployed product at scale, inference dominates lifetime emissions because billions of inference calls compound, while pretraining is paid once. (b) For a research lab without a large user base, pretraining can dominate (one big training run, few inferences); for an API provider, inference is 90%+. (c) Embodied carbon of the GPUs themselves (manufacturing, transport, datacenter buildout) is rarely reported. It can add 20-40% to lifetime operational emissions and is invisible in PUE-only reporting.
A 70B model is trained on 15T tokens with the Chinchilla compute formula $C \approx 6ND$. Assume 40% MFU on H100s (~2 PFLOP/s peak fp16 at 700W). (a) Estimate total GPU-hours. (b) Estimate energy in MWh. (c) Convert to tCO2 assuming the US grid average of 0.4 kg CO2/kWh.
Answer Sketch
(a) FLOPs = 6 x 70e9 x 15e12 = 6.3e24. At 40% MFU on 2 PFLOP/s = 0.8 PFLOP/s effective per GPU. GPU-seconds = 6.3e24 / 0.8e15 = 7.9e9. GPU-hours = ~2.2 million. (b) Energy = 2.2e6 hours x 0.7 kW x PUE 1.2 = ~1.85 GWh = 1850 MWh. (c) 1850 MWh x 0.4 t/MWh = 740 tCO2. Comparison: an average US car emits ~5 tCO2/year, so this single training run equates to ~150 car-years. Scaling and inference operate at 100-1000x this annually for frontier providers.
Sketch a 6-line integration of codecarbon into a PyTorch training script that logs emissions per epoch and uploads the report at the end. State one limitation of codecarbon's estimates.
Answer Sketch
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="llm_run", save_to_file=True)
tracker.start()
for epoch in range(N): train_one_epoch(); tracker.flush() # logs per-epoch delta
emissions = tracker.stop() # kg CO2eq
upload_report("emissions.csv")
codecarbon integration: EmissionsTracker.start() begins NVML polling, the per-epoch flush() writes interval deltas, and stop() returns the cumulative kg CO2eq for the run. The limitation noted below the snippet (no PUE, no embodied carbon, no hourly grid intensity) explains why audit-grade reporting layers Electricity Maps on top.Limitation: codecarbon reads GPU power via NVML and multiplies by a regional emissions factor; it doesn't capture cooling losses (PUE), embodied carbon, or grid-mix variation by hour. For audit-grade reporting, supplement with datacenter-level reports and a time-of-use grid-intensity API like Electricity Maps.
Inference-per-token energy has fallen by ~10x in three years through quantization, KV cache optimization, and better batching. Total LLM-related electricity consumption has roughly tripled in the same period. Explain (a) why these two facts are consistent, (b) what economic phenomenon they illustrate, and (c) what intervention would actually reduce total emissions.
Answer Sketch
(a) Per-token energy fell, but cheaper tokens unlocked vastly more use cases (every PR review, every chat reply, every search), so total token volume grew faster than per-token cost fell. Total = volume x intensity, and volume won. (b) This is the Jevons paradox or rebound effect: efficiency gains in a desirable resource often increase total consumption rather than reducing it. (c) Pure efficiency improvement does not bound emissions. Effective interventions are demand-side or grid-side: carbon-aware compute scheduling (run training when grids are clean), pricing that includes carbon externality, and direct procurement of additional clean power capacity at the datacenter. Voluntary reduction by individual users does not scale.
For PEFT methods that cut training compute drastically, see Section 17.5. For inference-side optimizations (KV cache, quantization) that lower carbon per query, see Section 9.3. For evaluation methods that report cost and carbon alongside quality, see Section 42.3. For shipping-time considerations that compound sustainability.
Carbon-aware scheduling is an emerging paradigm where training jobs are automatically routed to data centers with the lowest real-time carbon intensity, shifting compute across regions and time zones.
Early results from Google and Microsoft show 20 to 30% emission reductions with minimal latency impact.
Meanwhile, neuromorphic and optical computing architectures promise orders-of-magnitude improvements in energy efficiency for inference workloads, though they remain years from production readiness for LLM-scale models.
- Training a single large LLM can emit hundreds of tons of CO2, comparable to the lifetime emissions of several automobiles.
- Training efficiency metrics (FLOP per token, tokens per kWh, performance per dollar) enable fair comparison of environmental impact across models and training runs.
- Key reduction strategies include using efficient architectures, training on cleaner energy grids, knowledge distillation, and choosing smaller models when they suffice.
- Carbon tracking tools (CodeCarbon, ML CO2 Impact) estimate emissions from compute usage, enabling teams to report and reduce their environmental footprint.
- The rebound effect means that efficiency gains can increase total consumption by making LLM usage cheaper and more widespread.
Exercises
Using the TrainingCarbonEstimate class from Section 55.1.1, estimate the carbon footprint for training a 13B parameter model on 1T tokens using (a) H100 GPUs in a US average grid, (b) A100 GPUs in the same grid, and (c) H100 GPUs in a Nordic data center (25 gCO2/kWh). Compare the results and identify which single factor has the largest impact on emissions.
Answer Sketch
Location dominates: moving from US average (400 gCO2/kWh) to Nordic (25 gCO2/kWh) yields a 16x reduction in emissions. Hardware generation (A100 to H100) provides roughly a 2 to 3x improvement. Both factors are multiplicative, so the greenest option combines newest hardware with cleanest grid.
Integrate CodeCarbon into a Hugging Face Trainer loop for fine-tuning GPT-2. Log emissions to W&B or MLflow. Run three experiments with different hyperparameters. Create a scatter plot of validation loss vs. CO2 emissions and identify the Pareto-optimal configuration.
Answer Sketch
Larger batch sizes improve MFU and reduce total training time, often producing lower emissions for the same number of epochs. The Pareto frontier typically shows a knee where further quality improvement requires disproportionately more compute.
The next section, Section 56.1: Platforms, surveys the responsible-AI platforms and toolkits that operationalize the carbon, fairness, and transparency obligations introduced in this chapter.