Section 56.2

Libraries and Frameworks

"AIF360 says 0.82, Fairlearn says 0.83, Aequitas says 0.81, and the lawsuit lives in the third decimal place. Pick a library and document why."

GuardGuard, Six-Layer Responsible-AI Toolbelt AI Agent
Big Picture

The responsible AI library landscape in 2026 partitions into six layers: fairness metric and mitigation libraries (AI Fairness 360, Fairlearn, Aequitas, FairML, Themis) that compute disparate-impact-style statistics and train constrained classifiers; explainability libraries (SHAP, LIME, Captum, TransformerLens, BertViz, Inseq, Ecco) that attribute a prediction to inputs or activations; counterfactual generators (DiCE, Alibi, Wachter counterfactuals) that produce minimal-change inputs to flip a prediction; LLM bias and red-team suites (BBQ scripts, BOLD, StereoSet, CrowS-Pairs, HELM bias slice, PyRIT, garak) for generative-model evaluations; watermarking and provenance libraries (Kirchenbauer wmark, SynthID-Text, C2PA, Adobe Content Authenticity SDK, Project Origin) for output attribution; and privacy and federated-learning libraries (Opacus, Google DP, IBM Diffprivlib, Flower, FedML, PySyft) for differential-privacy training and decentralized data. This section is a tour with opinionated pick-when guidance.

Prerequisites

This section assumes the responsible-AI platforms from Section 56.1, the differential-privacy fundamentals from Section 53.3, and the LLM-watermarking techniques from Section 54.2.

The shape of the stack converges on a familiar pattern: a single application typically combines four to seven libraries, each owning one slice (fairness metric, explainability attribution, watermark, privacy noise injection, runtime guard) plus a thin glue layer that ties them to the model and pipeline. Picking each layer well matters because the wrong choice locks out integrations (Captum is PyTorch-only; Fairlearn is sklearn-shaped; SynthID watermarks only work with Google-hosted models) and produces subtle measurement bugs (different fairness libraries disagree on tie-breaking and rounding in disparate-impact ratios, leading to teams arguing past each other).

56.2.1 Fairness metric and mitigation libraries

Three small cartoon kitchen scales sit on a counter, each labelled AIF360, Fairlearn, and Aequitas, all weighing the same little package labelled Dataset but showing slightly different readouts of 0.82, 0.83, and 0.81, while a confused engineer leans in with a magnifying glass.
Figure 56.2.1: Fairness libraries are like kitchen scales. They all measure the same thing, disagree on the third decimal place, and that decimal is where the lawsuits live.

Run the same dataset through AIF360 and Fairlearn and you may get disparate-impact ratios that disagree at the third decimal. The third decimal is where the lawsuit lives. Fairness libraries compute group-disparity statistics (disparate impact, statistical parity, equal opportunity, equalized odds, demographic parity, predictive parity) and ship bias-mitigation algorithms that apply at one of three stages: pre-processing the data, in-processing during training, or post-processing the predictions. Picking the library shapes both what you can measure and which mitigation knobs you have.

Algorithm 56.2.1: Algorithm: Fairness Metrics Primer

Let $A$ denote a protected attribute, $Y \in \{0,1\}$ the ground-truth label, and $\hat{Y} \in \{0,1\}$ the model's prediction. The four canonical group-fairness criteria are:

Demographic parity (also called statistical parity or independence): the positive-prediction rate is equal across groups, $P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1)$. This is what disparate-impact ratios measure.

Equalized odds (Hardt, Price, Srebro 2016): the true-positive rate and false-positive rate are both equal across groups, $P(\hat{Y}=1 \mid A=a, Y=y) = P(\hat{Y}=1 \mid A=a', Y=y)$ for $a \neq a'$ and $y \in \{0,1\}$.

Equal opportunity: equalized odds restricted to $Y=1$, i.e. only the true-positive rates must match across groups. Weaker than full equalized odds, often the operational target when false-negatives carry the dominant social cost.

The 4/5ths rule (disparate impact): a US-EEOC rule-of-thumb that flags potential discrimination when $\frac{P(\hat{Y}=1 \mid A=0)}{P(\hat{Y}=1 \mid A=1)} < 0.80$ (with $A=0$ the disadvantaged group). Equivalent to demographic-parity-ratio $\geq 0.80$.

Worked example showing the demographic-parity vs equalized-odds tension. Consider a toy dataset with $1000$ rows split into two groups $A=0$ ($n_0 = 500$) and $A=1$ ($n_1 = 500$), with base rates $P(Y=1 \mid A=0) = 0.10$ (50 positives) and $P(Y=1 \mid A=1) = 0.10$ (50 positives), totalling 100 positives and 900 negatives. Suppose the model achieves $\text{TPR}_0 = \text{TPR}_1 = 0.80$ (equalized odds holds for $Y=1$) and $\text{FPR}_0 = 0.05$, $\text{FPR}_1 = 0.15$ (equalized odds fails). Then $P(\hat{Y}=1 \mid A=0) = 0.80 \cdot 0.10 + 0.05 \cdot 0.90 = 0.125$ and $P(\hat{Y}=1 \mid A=1) = 0.80 \cdot 0.10 + 0.15 \cdot 0.90 = 0.215$, giving a demographic-parity ratio of $0.125 / 0.215 \approx 0.58$, well below the 4/5ths threshold. Conversely, post-processing to enforce demographic parity (matching positive rates by raising $\text{FPR}_0$ or shrinking $\text{FPR}_1$) breaks equalized odds. This 1-row demonstration captures the core impossibility: when base rates are equal but error costs differ, the two criteria can be made to align; when base rates differ across groups (as in COMPAS), they cannot, an intuition formalized in Section 56.3.

Library Shortcut
Fairlearn MetricFrame for grouped fairness metrics

The math above (TPR, FPR, demographic-parity ratio) is what you would compute by hand; the production form is one call to MetricFrame, which slices any sklearn-style metric by a sensitive attribute and returns a tidy DataFrame plus group-disparity summaries. Prefer Fairlearn when the team is on sklearn and the goal is "drop fairness reporting into an existing classifier" rather than re-architect the training loop; AIF360 is the alternative when you need the broader mitigation-algorithm catalog.

Show code
pip install fairlearn
from fairlearn.metrics import (
    MetricFrame, selection_rate, demographic_parity_ratio,
)
from sklearn.metrics import accuracy_score, false_positive_rate

mf = MetricFrame(
    metrics={"accuracy": accuracy_score,
             "selection_rate": selection_rate,
             "fpr": false_positive_rate},
    y_true=y_test, y_pred=y_pred, sensitive_features=A_test,
)
print(mf.by_group)              # per-group accuracy / selection / FPR
print(mf.difference())          # max-min gap per metric
print(demographic_parity_ratio(y_true=y_test, y_pred=y_pred,
                               sensitive_features=A_test))  # 0.0-1.0
Code Fragment 56.2.1.1: A grouped fairness report and disparity ratio in one MetricFrame.

56.2.2 Explainability libraries

Explainability libraries answer "why did the model produce this output?" via input attribution (per-feature contribution), local approximation (a simple model that mimics the complex one near a point), or activation analysis (inspecting internal model states).

Algorithm 56.2.2: Algorithm: SHAP Shapley value and its four axioms

For a model with feature set $F$ and a value function $v(S)$ giving the model's expected output when features in $S \subseteq F$ are fixed to their input values (and the rest marginalized), the SHAP value of feature $i$ is the Shapley value from cooperative game theory (Shapley 1953):

$$\phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F|-|S|-1)!}{|F|!} \,\bigl[v(S \cup \{i\}) - v(S)\bigr]$$

This is the unique attribution that satisfies four axioms (Lundberg & Lee 2017). Efficiency: the attributions sum to the prediction minus the baseline, $\sum_i \phi_i = v(F) - v(\emptyset)$. Symmetry: if two features contribute identically to every coalition, they receive equal attribution. Dummy: a feature that adds nothing to any coalition receives $\phi_i = 0$. Additivity: for a sum of two models $f_1 + f_2$, the attributions add, $\phi_i^{f_1+f_2} = \phi_i^{f_1} + \phi_i^{f_2}$, which is what makes SHAP compose over ensemble methods.

The exact computation is $O(2^{|F|})$, infeasible beyond ~25 features. KernelSHAP (model-agnostic) approximates via weighted linear regression on sampled coalitions, with cost $O(K \cdot |F|)$ for $K$ samples; in practice $K \in [200, 10000]$ trades variance for speed. TreeSHAP (tree-ensemble-specific, Lundberg et al. 2020) exploits the tree structure to compute exact Shapley values in $O(T L D^2)$ per prediction, where $T$ is number of trees, $L$ leaves, $D$ depth, making it tractable on production-scale XGBoost / LightGBM models. The factor-of-thousand speedup is why TreeSHAP is the default explainer when the model is tree-shaped, and why teams on neural models often distill them through a tree surrogate before explaining.

56.2.3 Counterfactual generators

Counterfactual explanations answer "what minimal change to the input would have changed the prediction?" They are an alternative to feature attribution that is often more actionable for end users (telling a denied credit applicant which feature value would have flipped the decision).

56.2.4 LLM bias and red-team suites

LLM-era bias evaluation requires datasets and harness scripts beyond the tabular fairness libraries. The 2024-26 toolkit clusters around dataset-specific evaluation scripts and red-team automation frameworks.

56.2.5 Watermarking libraries

Watermarking libraries embed a detectable signal in model outputs (images, audio, text) so downstream tools can attribute provenance. The 2024-26 watermarking literature has split into text watermarks (Kirchenbauer green/red list, SynthID-Text, undetectable variants) and image / audio watermarks (SynthID, Stable Signature, AudioSeal).

The recurring worry with all of these is robustness to a determined adversary, who will try to wash the watermark out by paraphrasing or re-editing the content, as Figure 56.2.2 dramatizes.

A cartoon scientist labelled Kirchenbauer Watermark stamps a green tint onto a passing sheet of paper; downstream, a small forger character labelled Paraphraser scrubs the same paper in a wash basin, but a faint green dye still clings to it.
Figure 56.2.2: Why watermark robustness is a research problem, not a solved one. The green-list signal survives a paraphrase scrub only partially; a strong-enough rewrite eventually launders it, which is exactly what undetectable-watermark and robustness research is trying to quantify.

56.2.6 Provenance and content-credentials libraries

Provenance libraries record where a piece of content came from and how it was edited, complementing watermarks by attaching cryptographically-signed metadata rather than embedding a statistical signal.

56.2.7 Differential privacy and federated learning libraries

Privacy-preserving ML libraries split into differential privacy (training with calibrated noise so any individual's data cannot be inferred from the model) and federated learning (training across devices or organizations without centralizing the data).

Key Insight
What $(\epsilon, \delta)$-DP actually bounds, and what Opacus injects

A randomized mechanism $M$ is $(\epsilon, \delta)$-differentially private (Dwork et al. 2006, 2014) if for any two neighboring datasets $D, D'$ differing in one record and any measurable output set $S$,

$$\Pr[M(D) \in S] \le e^{\epsilon} \,\Pr[M(D') \in S] + \delta.$$

Operationally, an adversary observing $M$'s output cannot distinguish whether any single individual was in the training set with confidence better than the bound above; $\epsilon$ governs the multiplicative leakage and $\delta$ the catastrophic-failure probability (usually $\delta \ll 1/n$ where $n$ is dataset size). The Gaussian mechanism achieves $(\epsilon, \delta)$-DP by adding noise $\mathcal{N}(0, \sigma^2 I)$ to a function of sensitivity $\Delta_2$ (the maximum L2-norm change in output when one record is added or removed), with noise scale $\sigma \ge c \cdot \Delta_2 / \epsilon$ where $c = \sqrt{2 \ln(1.25/\delta)}$ for the basic mechanism (tighter for the moments accountant, Abadi et al. 2016). This is what Opacus applies during DP-SGD: per-sample gradients are clipped to norm $C$ (bounding $\Delta_2$), then Gaussian noise with the calibrated $\sigma$ is added to the aggregated mini-batch gradient. Across $T$ training steps the privacy cost composes; the Rényi-DP / moments accountant lets practitioners report a final $(\epsilon, \delta)$ rather than the loose $T \cdot \epsilon$ from naive composition.

Typical published $\epsilon$ values for production NLP: Apple's on-device learning operates around $\epsilon \approx 8$ per release, Google's gboard next-word prediction reports $\epsilon$ in the single-digit-to-low-double-digit range, and the Census Bureau's 2020 disclosure used $\epsilon \approx 19.6$ across the entire data product. The folk threshold $\epsilon \le 1$ corresponds to research-grade strong privacy with substantial accuracy loss; $\epsilon \in [1, 10]$ is the production-NLP working range; $\epsilon > 100$ retains the formal label but provides little meaningful protection. Tuning $\epsilon$ in Opacus is therefore the consequential decision that sets accuracy-vs-leakage; library defaults are starting points, not policy.

56.2.8 A canonical 2026 responsible-AI stack

Real-World Scenario
A boring-but-correct 2026 responsible AI stack

Who: A 2026 production ML team adding responsible-AI tooling to an existing model pipeline.

Situation: The team had to satisfy governance reviewers (Section 56.1's platform layer), red-teamers, and privacy counsel, while continuing to ship models on the existing cadence.

Problem: The responsible-AI library landscape is large and overlapping, and ad-hoc adoption produced inconsistent CI gates, missing audit evidence, and per-team library wars.

Dilemma: Either let each team pick its own libraries (fragmentation, regulator-unfriendly inconsistency) or impose a single mandated stack and risk overreach.

Decision: They standardized on a "boring-but-correct" library stack rather than chasing novelty.

How: The stack was: Fairlearn for sklearn-shaped fairness assessment and the Reductions mitigation; SHAP for tabular model explainability plus Captum for any PyTorch-based deep model; DiCE for counterfactual explanations to end users; HELM bias slices, BBQ, BOLD, CrowS-Pairs for LLM bias evaluation; PyRIT or garak for systematic red-teaming; SynthID-Text or Kirchenbauer wmark for text watermarking; c2pa-rs / Adobe Content Authenticity SDK for provenance manifests; Opacus for any differential-privacy fine-tuning.

Result: Every model release shipped with the evidence the governance platform expected (fairness slices, explanations, bias suite outputs, red-team report, provenance manifest), with no novel libraries to justify to reviewers.

Lesson: The wins in responsible-AI tooling are mostly in wiring boring libraries into CI and the model registry so every release ships the evidence the governance platform expects, not in adopting the latest research library.

Key Insight
Different fairness libraries disagree by 1-3 percentage points

A subtle and important fact: AIF360, Fairlearn, and Aequitas computing the same disparate-impact ratio on the same dataset can produce numbers that differ by 1-3 percentage points due to different tie-breaking, rounding, and treatment of NaN protected attributes. This is rarely a bug; it is the consequence of underspecified definitions in the fairness literature. In governance settings, this means "the fairness number" must always be reported with the library and version that computed it. Production teams who switch libraries mid-program often discover they have moved their reported fairness score in ways that are not real model changes.

Key Insight
Explainability libraries do not produce ground truth, they produce explanations

SHAP, LIME, Captum, and friends compute attributions according to specific axioms; different libraries' attributions can disagree even on the same model and input. The right way to use these is as one piece of evidence in a fairness or governance argument, not as ground truth about model behavior. The mechanistic-interpretability community (TransformerLens, sparse autoencoders, circuit-level work) is the deeper effort to extract ground truth, but for production explainability the conventional libraries (SHAP, Captum) are the de facto standard.

Library Shortcut
Thinnest viable fairness stack (AIF360 + Fairlearn + Aequitas)

For a team that needs to land fairness metrics, mitigation, and a regulator-readable report in a single afternoon, the canonical three-library stack covers the surface with no overlap. AIF360 supplies the broadest metric catalog (70+ statistics) and the largest mitigation algorithm set (Reweighing, Adversarial Debiasing, Calibrated Equalized Odds), making it the default when reproducing a published technique or sweeping pre-/in-/post-processing options. Fairlearn supplies the sklearn-native MetricFrame for slicing any metric by sensitive attribute, the ExponentiatedGradient and GridSearch reductions for constrained training, and tight integration with Microsoft Responsible AI Dashboard, making it the default for "bolt fairness onto an existing pipeline". Aequitas supplies the policy-style Bias Report (group-by-group disparity tables with plain-language narrative) accepted by auditors and journalists under NYC LL 144 and similar regimes, making it the default for the audit deliverable. The three are complementary not redundant: AIF360 for the algorithm library, Fairlearn for the pipeline integration, Aequitas for the report. All three are LF AI & Data Foundation projects, all Apache 2.0, all install with one pip install each. Adopt this trio first; reach for the niche libraries (FairTest, Themis, FairSD) only when a specific failure mode the trio cannot diagnose appears.

Show code
# Three-library fairness pass: metric, mitigation, and audit report.
from fairlearn.metrics import MetricFrame, selection_rate
from sklearn.metrics import accuracy_score

# 1. Fairlearn: slice any metric by sensitive attribute.
frame = MetricFrame(metrics={"accuracy": accuracy_score,
                              "selection_rate": selection_rate},
                    y_true=y_test, y_pred=y_pred,
                    sensitive_features=A_test)
print(frame.by_group)  # per-group disparity table

# 2. AIF360: reweighing mitigation on the training set.
from aif360.algorithms.preprocessing import Reweighing
rw = Reweighing(unprivileged_groups=[{"sex": 0}],
                privileged_groups=[{"sex": 1}])
ds_rebalanced = rw.fit_transform(ds_train)

# 3. Aequitas: regulator-readable bias report.
from aequitas.group import Group
xtab, _ = Group().get_crosstabs(df_predictions)
Code Fragment 56.2.2a: Three-library fairness pass: metric, mitigation, and audit report.

56.2.9 Licensing and deployment considerations

Most of the libraries in this section are permissively licensed (Apache 2.0 or MIT), with a few exceptions to know:

The most common deployment failure across these libraries is operational: a fairness library produces metrics that nobody on the team reviews, an explainability library produces explanations nobody surfaces to users, a watermarking library embeds signals nobody verifies. The library is the easy part; the integration into the team's review and incident-response workflows is the hard part. Section 56.1's governance platforms exist partly to close this gap.

56.2.10 Library evaluation checklist

The questions to ask when picking among multiple libraries in the same layer:

What's Next?

In the next section, Section 56.3: Datasets and Benchmarks, we build on the material covered here.

Further Reading
Bellamy, R. K. E., et al. (2018). "AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias." arXiv:1810.01943. arxiv.org/abs/1810.01943. The foundational fairness toolkit; defines the metric and algorithm catalog most commercial platforms implement subsets of.
Lundberg, S., & Lee, S. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS 2017. arxiv.org/abs/1705.07874. The paper introducing SHAP and the Shapley-value foundation for model-agnostic attribution.
Kirchenbauer, J., et al. (2023). "A Watermark for Large Language Models." ICML 2023. arxiv.org/abs/2301.10226. The green-list / red-list text-watermarking method whose open-source reference implementation defined the field.
Abadi, M., et al. (2016). "Deep Learning with Differential Privacy." CCS 2016. arxiv.org/abs/1607.00133. The DP-SGD foundation that Opacus, TensorFlow Privacy, and diffprivlib implement.
C2PA Coalition (2023). "C2PA Technical Specification 1.4." Coalition for Content Provenance and Authenticity. c2pa.org/specifications/specifications/1.4. The provenance standard implemented by c2pa-rs, Adobe Content Authenticity SDK, and the 2024+ OpenAI / Anthropic generation-model signings.
Wachter, S., Mittelstadt, B., & Russell, C. (2017). "Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR." Harvard Journal of Law & Technology, 31, 841. arxiv.org/abs/1711.00399. The legal-foundation paper for counterfactual explanations that DiCE, Alibi, and CARLA all build on.