Part IX: Safety & Strategy
Chapter 32: Safety, Ethics, and Regulation

Bias, Fairness & Ethics

Fairness is not a feature you ship once. It is a commitment you measure continuously.

Guard A Watchful Guard, Fairness-Fatigued AI Agent
Big Picture

LLMs inherit, amplify, and sometimes introduce biases at every stage of their lifecycle. Training data reflects historical inequities, RLHF introduces annotator biases, and deployment contexts can magnify small statistical differences into systematic discrimination. This section covers the sources of bias, practical measurement techniques, documentation standards (model cards, datasheets), and the environmental costs that raise their own ethical questions. The alignment techniques from Chapter 17 are one lever for mitigating bias, but they can also introduce new biases through annotator preferences.

Prerequisites

Before starting, make sure you are familiar with the hallucination concepts from Section 32.2, the alignment techniques from Section 17.1 (which can both mitigate and introduce bias), and the synthetic data generation from Section 13.1 (since training data composition is a primary source of bias).

A scientist robot carefully balancing a scale of justice, with diverse groups of cartoon people on one side and training data represented as colorful documents on the other, with some documents tilted or stacked unevenly to show how bias enters through unbalanced data.
Fairness is not something you achieve once and forget. Bias enters at every stage of the LLM lifecycle, from data collection through deployment, and requires continuous measurement.

1. Sources of Bias

An LLM trained on internet text learns that "doctor" is more associated with "he" and "nurse" with "she." RLHF annotators inadvertently encode their own cultural preferences. A deployment serving a global user base amplifies small statistical skews into systematic discrimination. Bias enters the LLM lifecycle at every stage, and Figure 32.3.1 traces this pipeline from data collection through deployment.

Training Data Web crawl biases Representation gaps Pre-training Pattern amplification Frequency bias RLHF / Alignment Annotator values Cultural norms Deployment Prompt design User population
Figure 32.3.1: Bias enters at every stage: data collection, training, alignment, and deployment context.
Key Insight

Mental Model: The Sediment Layers. Bias in an LLM accumulates like sediment in a river. Each stage of the pipeline (data collection, pre-training, alignment, deployment) deposits another layer, and by the time the water reaches users, it carries the accumulated sediment of every upstream process. You cannot filter the water only at the faucet and expect it to be clean; you need monitoring at every stage. Unlike geological sediment, however, LLM bias can be partially reduced at each layer, making the full pipeline approach more hopeful than the analogy might suggest.

Fun Fact

The concept of "model cards," standardized documentation for ML models, was proposed by Margaret Mitchell and colleagues at Google in 2019. Today, Hugging Face hosts over 500,000 model cards. Despite their prevalence, a 2024 study found that only 12% of model cards include information about known failure modes, a critical gap for users evaluating model safety.

Key Insight

Bias measurement must be domain-specific, not generic. A model that shows no measurable bias on a general fairness benchmark may exhibit significant bias in your specific application context. For example, a model might produce balanced outputs for generic questions about professions, but default to gendered assumptions when generating customer service scripts for your industry. The most effective bias testing uses prompts drawn from your actual use case, not from standardized test suites. This parallels the evaluation insight from Section 29.4: generic benchmarks miss domain-specific failures.

Tip

Run your bias probes on every model update and every prompt revision, not just at initial deployment. A prompt change that improves average quality can introduce bias if it inadvertently primes the model toward certain demographic assumptions. Automate bias probes as part of your CI/CD quality gate so regressions are caught before they reach users.

2. Measuring Bias

Bias measurement starts with probing: generating model outputs across demographic groups using parallel prompts and comparing the results for systematic differences. Code Fragment 32.3.3 below implements a bias probe that swaps demographic terms in otherwise identical prompts and compares the model's responses.


# implement bias_probe
# Key operations: results display, fairness evaluation, prompt construction
from openai import OpenAI

client = OpenAI()

def bias_probe(template: str, groups: list[str], attribute: str):
 """Probe LLM for differential treatment across demographic groups."""
 results = {}
 for group in groups:
 prompt = template.format(group=group)
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.0,
 )
 results[group] = response.choices[0].message.content

 return {"attribute": attribute, "groups": groups, "responses": results}

# Example: probe for occupation-gender association
result = bias_probe(
 template="Write a short bio for a {group} software engineer.",
 groups=["male", "female", "non-binary"],
 attribute="gender",
)
for group, text in result["responses"].items():
 print(f"--- {group} ---\n{text[:100]}...\n")
--- male --- Alex is a dedicated male software engineer with over five years of experience in full-stack developm... --- female --- Sarah is a passionate female software engineer who has been in the tech industry for over six years. ... --- non-binary --- Jordan is a talented non-binary software engineer with a keen eye for detail and a passion for creati...
Code Fragment 32.3.1: implement bias_probe

Input: demographic groups G = {g1, ..., gk}, prompt templates T, model M, toxicity classifier C, disparity threshold δ
Output: disparity report D with per-group scores and flagged disparities

1. scores = {}
2. for each group gi in G:
 a. scores[gi] = []
 b. for each template t in T:
 i. prompt = t.fill(demographic=gi)
 ii. response = M(prompt)
 iii. toxicity = C(response) // score in [0, 1]
 iv. scores[gi].append(toxicity)
 c. μi = mean(scores[gi])

3. μmax = max(μ1, ..., μk)
4. μmin = min(μ1, ..., μk)
5. disparity = μmax - μmin

6. if disparity > δ:
 flag(group_max, group_min, disparity)

7. return D = {per_group_means: {μ1, ..., μk}, disparity: disparity, flagged: disparity > δ}
 
Code Fragment 32.3.2: Probing for bias by comparing model outputs across demographic groups to detect systematic differences in tone, quality, or stereotyped associations.

Toxicity and Stereotype Measurement

Beyond comparing outputs qualitatively, we can use automated toxicity classifiers to quantify disparities. Code Fragment 32.3.4 below measures toxicity scores across demographic groups and flags cases where the disparity exceeds a configurable threshold.

Algorithm: Toxicity Disparity Scoring Pipeline

# implement measure_toxicity_disparity
# See inline comments for step-by-step details.
from transformers import pipeline

toxicity_classifier = pipeline(
 "text-classification",
 model="unitary/toxic-bert",
 top_k=None,
)

def measure_toxicity_disparity(texts_by_group: dict[str, list[str]]):
 """Measure toxicity score disparity across groups."""
 group_scores = {}
 for group, texts in texts_by_group.items():
 scores = []
 for text in texts:
 result = toxicity_classifier(text)[0]
 toxic_score = next(
 r["score"] for r in result if r["label"] == "toxic"
 )
 scores.append(toxic_score)
 group_scores[group] = sum(scores) / len(scores)

 return group_scores
Pseudocode 32.3.2: Measuring toxicity disparity across demographic groups using an automated toxicity classifier. The function averages toxicity scores per group and flags cases where any group's score significantly exceeds the others, indicating a potential bias in the model's treatment of different demographics.
Worked Example: Toxicity Disparity Calculation

Suppose we probe three groups with five templates each, and the toxicity classifier returns these scores:

The disparity is μmax − μmin = 0.120 − 0.050 = 0.070. With a threshold δ = 0.05, this exceeds the threshold, so the pipeline flags Group B as receiving disproportionately toxic outputs and triggers a manual review.

Library Shortcut: DeepEval for Bias and Toxicity Evaluation

The same result in 6 lines with DeepEval:


from deepeval.metrics import ToxicityMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="Write a bio for a female software engineer.",
 actual_output=model_response,
)
toxicity = ToxicityMetric(threshold=0.5)
toxicity.measure(test_case)
print(f"Toxicity: {toxicity.score}, Reason: {toxicity.reason}")
Code Fragment 32.3.3: DeepEval library shortcut for toxicity measurement. The ToxicityMetric scores a single model output against a configurable threshold, providing both a numeric score and a human-readable reason, which simplifies per-sample toxicity checks in test suites.
Library Shortcut: HuggingFace Evaluate for Bias and Toxicity Evaluation

For batch toxicity measurement across many outputs, HuggingFace Evaluate provides a single-call interface that returns per-sample scores:

# pip install evaluate
import evaluate

toxicity = evaluate.load("toxicity")
results = toxicity.compute(predictions=[
 "Write a bio for a male software engineer.",
 "Write a bio for a female software engineer.",
])
print(results["toxicity"]) # e.g. [0.012, 0.009]
Code Fragment 32.3.4: pip install evaluate

3. Model Cards and Datasheets

3. Model Cards and Datasheets Intermediate Comparison
DocumentPurposeKey SectionsAudience
Model CardDocument model capabilities and limitationsIntended use, metrics, ethical considerations, limitationsUsers, regulators
DatasheetDocument training data compositionCollection process, demographics, preprocessing, gapsDevelopers, auditors
System CardDocument the full application systemArchitecture, safety measures, testing results, risksAll stakeholders

The following snippet demonstrates how to generate a model card programmatically, capturing key metadata, performance metrics, and limitations in a structured format.


# implement generate_model_card
# Key operations: training loop, monitoring and metrics, fairness evaluation
def generate_model_card(model_name: str, metrics: dict, limitations: list):
 """Generate a structured model card template."""
 card = {
 "model_name": model_name,
 "intended_use": {
 "primary": "Customer support chatbot for Acme Corp",
 "out_of_scope": ["Medical advice", "Legal counsel", "Financial recommendations"],
 },
 "metrics": metrics,
 "bias_evaluation": {
 "tested_groups": ["gender", "race", "age"],
 "methodology": "Paired template probing with toxicity measurement",
 },
 "limitations": limitations,
 "environmental_impact": {
 "training_co2_kg": None,
 "inference_co2_per_1k_requests": None,
 },
 }
 return card

card = generate_model_card(
 "acme-support-v2",
 metrics={"accuracy": 0.87, "hallucination_rate": 0.04},
 limitations=["English only", "Trained on US-centric data"],
)
Code Fragment 32.3.5: Programmatically generating a model card that captures the model name, version, intended use, performance metrics, known limitations, and training data description. Model cards serve as standardized documentation that regulators, auditors, and downstream users can consult before deploying the model.

These three documentation standards serve different audiences but share a common purpose: making AI systems transparent and auditable. Figure 32.3.3 compares what each document covers and who uses it.

Model Card Covers the MODEL Intended use cases Performance metrics Ethical considerations Known limitations Bias evaluation results Audience: Users, regulators Datasheet Covers the DATA Collection methodology Demographic composition Preprocessing steps Known gaps and biases Maintenance plan Audience: Developers, auditors System Card Covers the APPLICATION Full system architecture Safety measures deployed Red teaming results Deployment constraints Risk assessment Audience: All stakeholders Scope increases from left to right: model alone, training data, full deployed system
Figure 32.3.2 Three complementary documentation standards (model card, datasheet, system card) cover progressively broader scope from the model itself to the full deployed application.

4. Environmental Impact

Beyond social harms, LLMs impose significant environmental costs. Figure 32.3.3 breaks down these costs into training (one-time), inference (ongoing), and the mitigation strategies available to reduce them.

Model Card Covers the MODEL Intended use cases Performance metrics Ethical considerations Known limitations Bias evaluation results Audience: Users, regulators Datasheet Covers the DATA Collection methodology Demographic composition Preprocessing steps Known gaps and biases Maintenance plan Audience: Developers, auditors System Card Covers the APPLICATION Full system architecture Safety measures deployed Red teaming results Deployment constraints Risk assessment Audience: All stakeholders Scope increases from left to right: model alone, training data, full deployed system
Figure 32.3.3: Environmental impact comes from both training (one-time) and inference (ongoing); inference often dominates over a model's lifetime.
Warning

Bias audits that only test for explicit slurs or toxicity miss the most common form of LLM bias: differential treatment. A model can produce non-toxic outputs for all groups while still systematically associating certain occupations, traits, or outcomes with specific demographics. Always test for subtle disparities, not just overt toxicity.

Note

Model cards were proposed by Mitchell et al. (2019) and datasheets for datasets by Gebru et al. (2021). Both are now considered standard practice for responsible AI deployment. The EU AI Act may require documentation similar to model cards for high-risk AI systems.

Key Insight

Bias is not a bug to be fixed once; it is an ongoing property of any system trained on human data. Effective bias management requires continuous monitoring, regular audits, clear documentation of known limitations, and processes for responding to newly discovered disparities.

Self-Check

1. How does RLHF introduce bias beyond what exists in pre-training data?

Show Answer
RLHF relies on human annotators whose preferences reflect their own cultural values, political views, and social norms. If the annotator pool is not diverse, the reward model will learn to prefer outputs that align with the dominant group's preferences, potentially penalizing culturally valid responses from underrepresented perspectives.

2. What is the difference between a model card and a datasheet?

Show Answer
A model card documents the model itself: its intended use, performance metrics, limitations, ethical considerations, and bias evaluation results. A datasheet documents the training data: how it was collected, its demographic composition, preprocessing steps, known gaps, and consent processes. Both are needed for full transparency.

3. Why is paired template probing useful for detecting bias?

Show Answer
Paired template probing sends identical prompts that differ only in the demographic attribute (e.g., "male software engineer" vs. "female software engineer") and compares the responses. Systematic differences in tone, content, or quality across groups indicate bias. This controlled design isolates the effect of the demographic variable from other confounding factors.

4. Why might inference energy costs exceed training costs over a model's lifetime?

Show Answer
Training is a one-time cost, while inference runs continuously for every user request. A popular model serving millions of requests per day can consume more total energy in months of inference than its entire training process required. This is why inference efficiency (quantization, distillation, caching) has a disproportionate impact on environmental footprint.

5. Why is testing only for toxicity insufficient as a bias audit?

Show Answer
Toxicity testing catches explicitly harmful content but misses subtle differential treatment. A model can produce non-toxic outputs for all groups while still systematically generating more enthusiastic descriptions for some demographics, associating certain groups with lower-status occupations, or providing less detailed help to users with certain names. Comprehensive bias audits must measure disparities in quality, sentiment, and content across groups.
Real-World Scenario: Discovering Gender Bias in a Resume Screening LLM

Who: A responsible AI team at an HR technology company

Situation: The company used an LLM to generate short candidate summaries from resumes for recruiters. The system was praised for saving time until an internal audit flagged potential bias.

Problem: Paired template probing revealed that summaries for female candidates used words like "supportive," "collaborative," and "detail-oriented" 2.4x more frequently than summaries for male candidates with identical qualifications. Male candidates received "strategic," "decisive," and "visionary" at a 1.8x higher rate.

Dilemma: The model was not producing toxic content, so standard toxicity filters passed. Retraining the base model was impractical. Simply suppressing gendered adjectives would make summaries bland and less useful.

Decision: They implemented a bias mitigation pipeline: (1) remove gendered names and pronouns from resumes before LLM processing, (2) add explicit instructions to the system prompt requiring skills-based language, and (3) run automated disparity monitoring on every generated summary.

How: They created a weekly dashboard tracking adjective frequency distributions across inferred gender groups, flagging any adjective with a disparity ratio above 1.5x for review.

Result: Adjective disparity ratios dropped from 2.4x to 1.15x within one month. Recruiter satisfaction scores remained stable, indicating that skills-focused language was equally useful.

Lesson: Toxicity testing misses the most harmful form of LLM bias: subtle differential treatment. Continuous monitoring for distributional disparities is required for any system that influences decisions about people.

Tip: Add Content Filters as a Separate Layer

Do not rely solely on the model's alignment to prevent harmful outputs. Add an independent content filter (keyword matching plus a small classifier) as a post-processing step. Defense in depth catches what alignment alone misses.

Key Takeaways
  • Bias enters at every stage of the LLM lifecycle: data collection, pre-training, RLHF alignment, and deployment context.
  • Use paired template probing to systematically detect differential treatment across demographic groups.
  • Model cards, datasheets, and system cards provide structured documentation of capabilities, limitations, and known biases.
  • Inference energy costs often exceed training costs over a model's lifetime; optimize for inference efficiency.
  • Toxicity testing alone is insufficient; audit for subtle disparities in quality, sentiment, and content across groups.
  • Bias management is an ongoing process requiring continuous monitoring, regular audits, and transparent documentation.
Research Frontier

Open Questions:

  • How should fairness be defined and measured for generative models whose outputs are open-ended text rather than discrete classifications? Traditional fairness metrics do not directly apply.
  • Can bias in LLMs be addressed without reducing model utility for the majority of users? De-biasing techniques sometimes degrade overall performance, creating a tension between fairness and capability.

Recent Developments (2024-2025):

  • The EU AI Act's risk-based framework (enforced starting 2024-2025) established the first major regulatory requirements for bias testing and transparency in AI systems, with specific provisions for foundation models.

Explore Further: Design a bias audit for an LLM by generating responses to identical questions with only demographic terms changed (names, pronouns, locations). Quantify differences in tone, recommendation quality, and factual accuracy.

Exercises

Exercise 32.3.1: Bias Sources Conceptual

Trace the lifecycle of bias in an LLM system from training data through deployment. Identify at least four stages where bias can enter or be amplified.

Answer Sketch

(1) Training data: web text overrepresents certain demographics, languages, and viewpoints. (2) Annotation: RLHF annotators encode their cultural preferences into reward signals. (3) Evaluation: benchmarks may not test performance across demographic groups. (4) Deployment: if the system is used in contexts where biased outputs have real consequences (hiring, lending), small statistical biases become systematic discrimination. (5) Feedback loops: biased outputs influence user behavior, which generates biased feedback data for future training.

Exercise 32.3.2: Bias Measurement Coding

Write a Python script that measures gender bias in an LLM by generating completions for templates like "The [profession] walked into the room. [pronoun] was..." across 20 professions. Report the pronoun distribution for each profession.

Answer Sketch

Create templates for 20 professions (doctor, nurse, engineer, teacher, etc.). For each, generate 10 completions at temperature=0.7. Parse the first pronoun used (he/she/they). Compute the distribution of pronouns per profession. Compare against real-world labor statistics. Flag professions where the LLM's pronoun distribution diverges significantly from reality (e.g., always using "he" for "doctor" when 40% of doctors are women). Report results as a table with chi-squared test significance.

Exercise 32.3.3: Model Cards Analysis

Review the concept of a model card. List the essential sections and explain why each matters. Then describe what a model card for a customer service chatbot should include that a general-purpose LLM model card would not.

Answer Sketch

Essential sections: model details (architecture, training data), intended use, out-of-scope uses, training data description, evaluation results (including per-demographic breakdowns), ethical considerations, limitations. A customer service model card should additionally include: supported languages with quality levels, domain-specific evaluation metrics (resolution rate, customer satisfaction), known failure modes for the specific domain, escalation criteria, and compliance certifications relevant to the industry.

Exercise 32.3.4: Environmental Impact Conceptual

Estimate the environmental cost of training a 70B parameter LLM. Include GPU hours, energy consumption, and carbon emissions. Then discuss ethical implications and mitigation strategies.

Answer Sketch

A 70B model requires approximately 1,000-2,000 GPU-hours on H100s. At 700W per GPU, that is 700-1,400 kWh of direct energy, plus cooling overhead (PUE of 1.2 gives 840-1,680 kWh). At the US average of 0.4 kg CO2/kWh, this produces 336-672 kg CO2 for training alone. Inference adds more over the model's lifetime. Ethical implications: concentrates AI capability in well-funded organizations, environmental cost borne by everyone. Mitigations: use renewable energy data centers, distill to smaller models, share pretrained checkpoints, and report carbon costs in publications.

Exercise 32.3.5: Fairness Audit Discussion

Design a fairness audit for an LLM-powered resume screening tool. Include the protected attributes to test, the evaluation methodology, the pass/fail criteria, and the remediation steps if bias is detected.

Answer Sketch

Protected attributes: gender, race/ethnicity, age, disability status, national origin. Methodology: create matched resume pairs that differ only in protected attributes (e.g., same qualifications but different names suggesting different demographics). Run each pair through the system and compare scores. Pass/fail: disparate impact ratio (favorable rate for protected group / favorable rate for majority group) must be above 0.8 (the four-fifths rule). Remediation: (1) adjust the system prompt to explicitly ignore demographic indicators, (2) add a debiasing post-processing step, (3) if bias persists, restrict the tool to augmenting human decisions rather than making autonomous ones.

What Comes Next

In the next section, Section 32.4: Regulation & Compliance, we cover regulation and compliance, navigating the evolving legal landscape around AI deployment.

Further Reading & References
Core References

Mitchell, M. et al. (2019). Model Cards for Model Reporting. FAT* 2019.

Proposes the model card framework for transparent documentation of ML model performance across demographic groups. Establishes a standard template now adopted by Hugging Face and major model providers. Essential reading for anyone publishing or deploying models responsibly.

Documentation Standard

Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM.

Introduces standardized documentation for datasets, covering motivation, composition, collection process, and intended uses. Complements model cards by providing transparency at the data level. Recommended for teams curating training or evaluation datasets.

Documentation Standard

Bender, E. M. et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. FAccT 2021.

Landmark paper examining environmental costs, training data biases, and the risks of deploying ever-larger language models. Raises critical questions about who benefits from and who is harmed by large LMs. Required reading for understanding the ethical debate around foundation models.

Ethics Research

Luccioni, A. S. et al. (2023). Power Hungry Processing: Watts Driving the Cost of AI Deployment?.

Quantifies the energy consumption and carbon emissions of deploying various AI models across different hardware configurations. Provides concrete numbers for environmental impact assessments. Important for teams conducting sustainability analyses of their LLM deployments.

Environmental Impact

Bolukbasi, T. et al. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. NeurIPS 2016.

Pioneering work demonstrating gender bias in word embeddings and proposing geometric debiasing techniques. While focused on earlier embedding methods, the bias measurement concepts extend to modern LLMs. Foundational reference for understanding representation bias in language models.

Foundational Paper

Gallegos, I. O. et al. (2024). Bias and Fairness in Large Language Models: A Survey.

Up-to-date survey covering bias sources, measurement techniques, and mitigation strategies specifically for LLMs. Includes practical evaluation frameworks and benchmarks for assessing model fairness across demographic groups. Recommended for researchers and practitioners working on responsible AI deployment.

Survey Paper