Section 55.3: Operating Under Compliance

"A model card without an emissions column is no longer a complete model card; the regulator counts those numbers even when the engineer does not."
An Audit-Ready Auditor, Compliance-Conscious AI Agent

Big Picture

Measurement (Section 55.1) and mitigation (Section 55.2) are technical disciplines. This section turns them into operating practice: how to profile energy and carbon at the experiment level using tokens-per-joule, which inference strategies actually move the needle once the model is in production, what the EU AI Act's Article 53 GPAI provisions require of model providers in 2026, and a 10-point Green-AI checklist that codifies the largest wins (the 4x region swap, INT4 quantization, semantic caching, MFU tuning) into a single deployment review.

Fun Fact

Did You Know: The first EU AI Act fine will be a paperwork fine

Timeline of EU AI Act Article 53 enforcement deadlines for GPAI providers from 2024 through 2027 — **Figure 55.3.1**: EU AI Act Article 53 enforcement timeline. The first two milestones are documentation only; the real teeth come in 2026 with high-risk conformity assessments and 2027 with full penalty exposure.

The Article 53 GPAI obligations that take effect in 2025-2027 are almost entirely disclosure rules: keep a technical file, publish a copyright-policy summary, report energy use. The first enforcement actions will not be over a model behaving badly; they will be over a vendor failing to publish the right document by the right date. Compliance teams who treat the Act like a safety regulation will be surprised; those who treat it like SOX-for-models will be ready.

Prerequisites

This section assumes the measurement vocabulary from Section 55.1: Quantifying the Environmental Cost and the mitigation strategies and carbon-tracking tools from Section 55.2: Reducing the Footprint. Familiarity with the EU AI Act compliance landscape from Section 53.1 is helpful for the regulatory discussion in 55.3.2.

55.3.1 Energy and carbon profiling at the experiment level

Moving beyond aggregate estimates, modern tooling enables per-experiment and per-inference carbon profiling. CodeCarbon (introduced in Section 55.2.2.1) provides the instrumentation layer, but two additional resources help contextualize your measurements: the ML.ENERGY Leaderboard, which ranks popular LLMs by energy consumption per output token across standardized benchmarks, and the MELODI benchmark, which evaluates models on a joint accuracy-per-joule metric. Together these tools let teams make informed model selection decisions that balance quality against environmental cost.

55.3.1.1 Tokens-per-Joule as an Efficiency Metric

While tokens-per-kWh is useful for training, inference efficiency is better captured at finer granularity. Tokens per joule (T/J) measures how many output tokens a serving system produces per joule of energy consumed. This metric normalizes across hardware, batch sizes, and request patterns, making it possible to compare a quantized 7B model on consumer GPUs against a dense 70B model on H100 clusters. Higher T/J is better.

Table 55.3.1a: Inference Energy Efficiency by Model Size (Approximate) (as of 2026).

Model	Parameters	Quantization	Energy per 1K tokens	CO₂ per 1K tokens (US avg grid)	Tokens/Joule
Llama-3 8B	8B	INT4 (GPTQ)	~0.4 Wh	~0.16 g	~700
Llama-3 8B	8B	FP16	~1.0 Wh	~0.40 g	~280
Llama-3 70B	70B	INT4 (AWQ)	~3.2 Wh	~1.28 g	~87
Llama-3 70B	70B	FP16	~8.0 Wh	~3.20 g	~35
Mixtral 8x7B (MoE)	47B (13B active)	FP16	~1.8 Wh	~0.72 g	~155
GPT-4o (API)	Undisclosed	Provider-managed	~5.0 Wh (est.)	~2.00 g (est.)	~56 (est.)

Note that these figures are approximate and vary with hardware, batch size, sequence length, and serving framework. The key takeaway is the magnitude of difference: INT4 quantization roughly doubles tokens-per-joule, and smaller models can be 10 to 20x more energy-efficient per token than their larger counterparts.

55.3.1.2 Green Inference Strategies

Five strategies reduce inference energy consumption without retraining:

Model distillation. A 7B student distilled from a 70B teacher delivers 80 to 90% of the teacher's quality at one-tenth the energy per request. See Section 17.5.
Post-training quantization. INT4/INT8 quantization (via GPTQ, AWQ, or bitsandbytes) cuts memory bandwidth and energy by 2 to 4x with minimal quality degradation. See Section 9.3.
Batch scheduling. Accumulating requests into larger batches improves GPU utilization. Continuous batching (vLLM, TGI) keeps the GPU occupied rather than idle between requests.
Semantic caching. Caching responses to frequently asked or semantically similar queries eliminates redundant computation entirely. See Section 70.5.
Region-aware routing. Route inference requests to data center regions with the lowest real-time carbon intensity. Cloud providers expose carbon-intensity APIs that enable dynamic routing decisions.

Key Insight

The efficiency paradox (Jevons paradox) in AI. Every green inference strategy listed above reduces the cost per token. Historically, lower per-unit costs lead to dramatically higher total usage, often overwhelming the efficiency gains. If quantizing your model cuts cost by 4x, and your product team responds by serving 10x more requests (longer conversations, more features, broader rollout), total energy consumption increases 2.5x despite the per-token improvement. Sustainable AI requires pairing technical efficiency with organizational discipline: setting energy budgets, tracking total consumption (not just per-token efficiency), and making conscious decisions about how efficiency gains are reinvested.

55.3.2 Policy perspectives and regulatory requirements

The EU AI Act (covered in detail in Section 47.1) includes environmental disclosure requirements for providers of general-purpose AI (GPAI) models. Article 53 requires GPAI providers to document the energy consumption of model training, and for models classified as having systemic risk, to report the energy consumption of inference as well. The European AI Office has indicated that standardized carbon reporting templates will be published as part of the implementing regulations.

Beyond the EU, several other jurisdictions are developing AI environmental disclosure requirements. The US Executive Order on AI (October 2023) called for research into AI's environmental impact. The UK's AI Safety Institute has included energy consumption in its model evaluation framework. China's interim measures for generative AI require providers to "adopt measures to prevent environmental damage." For organizations operating globally, implementing carbon tracking now prepares you for regulatory requirements that are converging across jurisdictions.

Warning

EU AI Act Article 53 GPAI Environmental Disclosure (2026 Timelines)

Article 53(1)(a)-(b) of Regulation (EU) 2024/1689 requires every GPAI model provider placed on the EU market to maintain technical documentation that includes estimated energy consumption during training. The Commission's Implementing Regulation 2025 tightened the timeline:

August 2, 2025: GPAI obligations under Article 53 became enforceable for models placed on the market after that date.
August 2, 2026: GPAI models with systemic risk (Article 51, the >10²⁵ FLOP threshold) must additionally report aggregate per-query inference energy and carbon, with a standardized template published by the European AI Office.
August 2, 2027: Pre-existing GPAI models (those on the market before August 2025) must be brought into compliance, including retroactive training-energy estimates.

The European AI Office's GPAI Code of Practice (signed by Anthropic, Google, OpenAI, and Mistral in 2025) specifies that disclosures must cite a methodology, and CodeCarbon, Climatiq, or Boavizta outputs (Section 55.2.2.3) are considered acceptable evidence when accompanied by a PUE assumption and grid-intensity source. Non-compliance carries fines of up to 3% of global annual turnover or €15M, whichever is higher (Article 101). See Regulation (EU) 2024/1689, Article 53.

Real-World Scenario

Building a Carbon Reporting Pipeline for EU AI Act Compliance

Who: A sustainability officer and an MLOps engineer at a European AI company providing GPAI models to enterprise clients

Situation: The EU AI Act's GPAI provisions required the company to include energy consumption and carbon emission estimates in their model technical documentation. They had no existing carbon tracking infrastructure.

Problem: Training runs spanned multiple GPU clusters across two cloud regions, and inference serving added a separate, ongoing emissions stream. The sustainability officer had no way to produce the per-model-version emission figures that the technical documentation required.

Decision: They built an automated five-stage pipeline: (1) instrument all training runs with CodeCarbon, (2) log per-experiment emissions to a central experiment tracker, (3) aggregate emissions by model version for the GPAI technical documentation, (4) sample inference power draw across representative request distributions and extrapolate to total serving volume, and (5) report both Scope 2 (purchased electricity) and Scope 3 (hardware manufacturing, data center construction) emissions. The pipeline was integrated into the CI/CD workflow discussed in Section 42.3.

Result: Carbon reports were generated automatically for each model release. The first complete report took 3 weeks of engineering to build; subsequent reports required zero manual effort. The company used the data to shift 60% of training jobs to a Nordic data center, reducing per-model emissions by 14x.

Lesson: Automating carbon reporting into the CI/CD pipeline turns a regulatory obligation into actionable data that drives real emission reductions.

55.3.3 Practical checklist for Green AI

Key Insight: Aha Moment: The 4x Region Swap

Patterson et al. (Google, 2022, "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink") published the carbon intensity of Google Cloud's training regions, and the gap is brutal: training the same model in us-central1 (Iowa, mostly coal and natural gas at 0.54 kg CO2/kWh in 2022) vs europe-north1 (Finland, 80 percent hydro and wind at 0.13 kg CO2/kWh) is a 4.2x difference in emissions for identical FLOPs, identical wall-clock, identical accuracy. The training bill in dollars is essentially the same because GPU rental is region-neutral; only the carbon-accounting column on the model card changes. The lesson that the checklist below codifies as step 2: "where you train" is a one-line change in your Terraform config that moves the carbon column by 4x. Most teams have never queried what their region's grid mix actually is, which is why the same checklist item is also the largest single-step win available.

Reuse before retraining. Check if an existing pretrained model meets your needs. Use prompt engineering, LoRA, or QLoRA before considering full fine-tuning.
Choose a green region. Select data center locations with low carbon intensity grids. Prefer regions with high renewable energy percentages.
Use efficient architectures. Prefer MoE, sparse attention, or smaller models where quality requirements allow.
Quantize for inference. Deploy models in INT4 or INT8 precision. The quality loss is often negligible; the energy savings are significant.
Track emissions. Instrument every training run with CodeCarbon. Log emissions alongside accuracy metrics.
Report FLOPs. Include compute cost in experiment reports. Enable Pareto-optimal model selection.
Avoid unnecessary experiments. Use learning rate finders, small-scale ablations, and early stopping to reduce wasted compute.
Optimize MFU. Higher model FLOPs utilization means less idle power draw. Profile and optimize data loading and communication overhead.
Cache inference results. Semantic caching (described in Section 70.5) eliminates redundant computation.
Prepare for regulation. Build carbon reporting into your model documentation pipeline for EU AI Act compliance.

Warning: Common Misconception

Readers often focus exclusively on training costs and overlook inference costs. While training a large model is energy-intensive, inference at scale (millions of requests per day) can cumulatively exceed training costs over the model's lifetime. A model trained once but served for a year may consume 10x more energy in inference than in training. Always include inference projections in environmental impact assessments.

Exercises

Exercise 55.3.1: Carbon Footprint Components Conceptual

The carbon footprint of an LLM has three lifecycle phases. (a) Name them and rank them by typical CO2 contribution for a frontier model deployed at scale. (b) Why is the answer different for a research lab versus a major API provider? (c) What is the single number most commonly missing from public carbon disclosures, and what does it understate?

Answer Sketch

(a) Phases: pretraining, fine-tuning/post-training, and inference. For a deployed product at scale, inference dominates lifetime emissions because billions of inference calls compound, while pretraining is paid once. (b) For a research lab without a large user base, pretraining can dominate (one big training run, few inferences); for an API provider, inference is 90%+. (c) Embodied carbon of the GPUs themselves (manufacturing, transport, datacenter buildout) is rarely reported. It can add 20-40% to lifetime operational emissions and is invisible in PUE-only reporting.

Exercise 55.3.2: Estimate Pretraining CO2 Calculation

A 70B model is trained on 15T tokens with the Chinchilla compute formula $C \approx 6ND$. Assume 40% MFU on H100s (~2 PFLOP/s peak fp16 at 700W). (a) Estimate total GPU-hours. (b) Estimate energy in MWh. (c) Convert to tCO2 assuming the US grid average of 0.4 kg CO2/kWh.

Answer Sketch

(a) FLOPs = 6 x 70e9 x 15e12 = 6.3e24. At 40% MFU on 2 PFLOP/s = 0.8 PFLOP/s effective per GPU. GPU-seconds = 6.3e24 / 0.8e15 = 7.9e9. GPU-hours = ~2.2 million. (b) Energy = 2.2e6 hours x 0.7 kW x PUE 1.2 = ~1.85 GWh = 1850 MWh. (c) 1850 MWh x 0.4 t/MWh = 740 tCO2. Comparison: an average US car emits ~5 tCO2/year, so this single training run equates to ~150 car-years. Scaling and inference operate at 100-1000x this annually for frontier providers.

Exercise 55.3.3: Add Carbon Tracking to a Training Loop Code Tweak

Sketch a 6-line integration of codecarbon into a PyTorch training script that logs emissions per epoch and uploads the report at the end. State one limitation of codecarbon's estimates.

Answer Sketch

from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="llm_run", save_to_file=True)
tracker.start()
for epoch in range(N): train_one_epoch(); tracker.flush()  # logs per-epoch delta
emissions = tracker.stop()  # kg CO2eq
upload_report("emissions.csv")

Code Fragment 55.3.1b: A six-line codecarbon integration: EmissionsTracker.start() begins NVML polling, the per-epoch flush() writes interval deltas, and stop() returns the cumulative kg CO2eq for the run. The limitation noted below the snippet (no PUE, no embodied carbon, no hourly grid intensity) explains why audit-grade reporting layers Electricity Maps on top.

Limitation: codecarbon reads GPU power via NVML and multiplies by a regional emissions factor; it doesn't capture cooling losses (PUE), embodied carbon, or grid-mix variation by hour. For audit-grade reporting, supplement with datacenter-level reports and a time-of-use grid-intensity API like Electricity Maps.

Exercise 55.3.4: The Rebound Effect Failure Mode

Inference-per-token energy has fallen by ~10x in three years through quantization, KV cache optimization, and better batching. Total LLM-related electricity consumption has roughly tripled in the same period. Explain (a) why these two facts are consistent, (b) what economic phenomenon they illustrate, and (c) what intervention would actually reduce total emissions.

Answer Sketch

(a) Per-token energy fell, but cheaper tokens unlocked vastly more use cases (every PR review, every chat reply, every search), so total token volume grew faster than per-token cost fell. Total = volume x intensity, and volume won. (b) This is the Jevons paradox or rebound effect: efficiency gains in a desirable resource often increase total consumption rather than reducing it. (c) Pure efficiency improvement does not bound emissions. Effective interventions are demand-side or grid-side: carbon-aware compute scheduling (run training when grids are clean), pricing that includes carbon externality, and direct procurement of additional clean power capacity at the datacenter. Voluntary reduction by individual users does not scale.

Exercises

Exercise 55.3.5: Carbon Footprint Estimation

Using the TrainingCarbonEstimate class from Section 55.1.1, estimate the carbon footprint for training a 13B parameter model on 1T tokens using (a) H100 GPUs in a US average grid, (b) A100 GPUs in the same grid, and (c) H100 GPUs in a Nordic data center (25 gCO₂/kWh). Compare the results and identify which single factor has the largest impact on emissions.

Answer Sketch

Location dominates: moving from US average (400 gCO₂/kWh) to Nordic (25 gCO₂/kWh) yields a 16x reduction in emissions. Hardware generation (A100 to H100) provides roughly a 2 to 3x improvement. Both factors are multiplicative, so the greenest option combines newest hardware with cleanest grid.

Exercise 55.3.6: Carbon-Aware Experiment Management

Integrate CodeCarbon into a Hugging Face Trainer loop for fine-tuning GPT-2. Log emissions to W&B or MLflow. Run three experiments with different hyperparameters. Create a scatter plot of validation loss vs. CO₂ emissions and identify the Pareto-optimal configuration.

Answer Sketch

Larger batch sizes improve MFU and reduce total training time, often producing lower emissions for the same number of epochs. The Pareto frontier typically shows a knee where further quality improvement requires disproportionately more compute.

What Comes Next

The next section, Section 56.1: Platforms, surveys the responsible-AI platforms and toolkits that operationalize the carbon, fairness, and transparency obligations introduced in this chapter.

Further Reading

Policy, profiling, and operations

European Parliament. (2024). "Regulation (EU) 2024/1689: Artificial Intelligence Act." Article 53. Article 53 of the EU AI Act requires providers of general-purpose AI models to report estimated energy consumption during training. A concrete example of how environmental reporting is becoming a regulatory obligation.

European Commission. (2025). Implementing Regulation laying down rules for the application of Regulation (EU) 2024/1689 as regards general-purpose AI models. The Commission's 2025 implementing acts that set the August 2025 / 2026 / 2027 enforcement timelines for Article 53 GPAI obligations and specify acceptable carbon-accounting evidence.

Lannelongue, L., Grealey, J., and Inouye, M. (2021). "Green Algorithms: Quantifying the Carbon Footprint of Computation." Advanced Science, 8(12). Introduces an online calculator for estimating the carbon footprint of computational workloads. A practical tool for researchers and teams wanting to quantify and report the environmental impact of their AI experiments.

ML.energy. "ML.energy Leaderboard." Open inference-side measurement leaderboard, ranking popular LLMs by joules-per-token on standardized hardware. The 2025 snapshot covers Llama-3, Mistral, Mixtral, DeepSeek-V3, and Qwen variants.

Luccioni, A. S., Jernite, Y., and Strubell, E. (2024). "Power Hungry Processing: Watts Driving the Cost of AI Deployment?" FAccT. Benchmarks per-query inference energy across modalities; the empirical anchor for the tokens-per-joule table in Section 55.3.1.1.