Model Bias & Fairness in AI | AI Security

⚠ What is Algorithmic Bias

Algorithmic bias occurs when an ML system produces systematically different outcomes for different demographic groups in ways that are unjustified, harmful, or discriminatory. Bias in AI is not simply a technical failure — it is a social and ethical problem that can amplify historical inequities at scale. A biased hiring algorithm evaluated by millions of applicants causes more harm than a single biased human decision-maker. Understanding where bias originates is the first step toward meaningful mitigation.

Sources of Bias

Historical data bias: Training data reflects past human decisions that encoded societal prejudices. A loan model trained on historical approvals may perpetuate historical discrimination against protected groups who were systematically denied credit.
Representation bias: Some groups are underrepresented in training data, so the model performs worse for them. Facial recognition systems trained primarily on lighter-skinned faces show 10–34% higher error rates on darker-skinned faces (Buolamwini & Gebru, 2018, Gender Shades study).
Measurement bias: Proxy variables used as labels encode societal bias. Criminal recidivism prediction using prior arrests treats "arrested" as a proxy for "criminal" despite differential policing rates.
Aggregation bias: Training a single model on heterogeneous subpopulations may produce a model that performs well on average but poorly for specific groups.
Deployment bias: A model deployed in a context different from where it was validated may produce different outcomes — a medical imaging model validated on US hospital data may perform poorly on different scanner models or patient demographics.

High-Stakes Real-World Cases

COMPAS recidivism (ProPublica 2016): Risk scores used in US criminal sentencing showed Black defendants twice as likely as white defendants to be falsely labeled high-risk for recidivism (45% vs 24% false positive rate)
Amazon hiring algorithm (2018): ML system trained on historical hiring data penalized resumes containing the word "women's" (as in "women's chess club") and downgraded graduates of all-women's colleges — the project was abandoned
Healthcare allocation (Obermeyer et al., 2019): A widely-used health risk scoring algorithm (serving ~200M US patients) systematically assigned lower risk scores to Black patients than equally sick white patients, using healthcare spending as a proxy for health needs — Black patients have historically received less care for the same conditions
Face recognition in law enforcement (2020): Multiple wrongful arrests in the US based on face recognition misidentification, with all documented cases involving Black individuals
Gender bias in LLMs (2023–2024): Studies showed GPT-4 and similar LLMs associate professional roles with specific genders, producing stereotyped outputs for occupational descriptions even in languages with gender-neutral pronouns

⚖ Fairness Definitions & the Impossibility Theorem

"Fairness" is not a single concept — there are dozens of mathematical definitions, many mutually incompatible. The Kleinberg-Mullainathan-Raghavan impossibility theorem (2016) proves that three natural fairness criteria — calibration, balance for the positive class, and balance for the negative class — cannot all be satisfied simultaneously when base rates differ between groups. This means any deployed system optimizes for some fairness metrics at the expense of others; the choice is an ethical and policy decision, not a purely technical one.

Fairness Metric	Definition	Formula	When to Use
Demographic Parity	Equal positive prediction rates across groups regardless of true outcome	P(Ŷ=1\|A=0) = P(Ŷ=1\|A=1)	When selection rates should be equal regardless of historical differences; hiring quotas
Equalized Odds	Equal TPR and FPR across groups — same performance for both true positives and true negatives	P(Ŷ=1\|Y=y,A=0) = P(Ŷ=1\|Y=y,A=1) for y∈{0,1}	When error rates should be equal across groups; recidivism, medical screening
Equal Opportunity	Equal TPR across groups — qualified individuals in all groups equally likely to be selected	P(Ŷ=1\|Y=1,A=0) = P(Ŷ=1\|Y=1,A=1)	When false negatives are most harmful; credit approval for creditworthy applicants
Calibration	Predicted probabilities match actual outcome frequencies within groups	P(Y=1\|Ŷ=p,A=a) = p for all a	When score interpretability is critical; clinical risk scores, actuarial tools
Individual Fairness	Similar individuals receive similar predictions — requires defining "similarity"	d(x,x') ≤ ε ⟹ \|f(x)-f(x')\| ≤ Lip·ε	When individual-level justice is paramount; requires domain-specific similarity metric
Counterfactual Fairness	Prediction doesn't change if protected attribute were different in a causal model	P(Ŷ\|A=a,X=x) = P(Ŷ\|A=a',X=x) for all a,a'	Causal reasoning about discrimination; requires causal graph of data generating process

The Impossibility Theorem in Practice

The COMPAS debate illustrates the impossibility theorem vividly. Northpointe (developers) showed COMPAS was well-calibrated: when it assigned a risk score, that score accurately predicted recidivism rates equally across racial groups. ProPublica showed COMPAS violated equalized odds: Black defendants had a higher false positive rate (predicted high-risk but didn't reoffend). Both findings are correct simultaneously — they measure different things. When base recidivism rates differ between groups (which they do, due to socioeconomic factors), you mathematically cannot have both calibration and equalized odds. Choosing which to prioritize is a normative, political decision that must involve affected communities and domain experts.

📊 Bias Detection & Measurement

AIF360 — AI Fairness 360 (IBM)

AIF360 is an open-source toolkit providing 70+ fairness metrics and 10+ debiasing algorithms. Key metrics computed: disparate impact ratio, statistical parity difference, equal opportunity difference, average odds difference, Theil index (individual fairness). Input: a dataset with feature matrix, label column, and protected attribute column. Output: a metrics report comparing fairness across demographic groups.

# AIF360 bias detection example
from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

dataset = BinaryLabelDataset(
    df=df,
    label_names=['loan_approved'],
    protected_attribute_names=['race'],
    favorable_label=1,
    unfavorable_label=0
)

metric = BinaryLabelDatasetMetric(
    dataset,
    unprivileged_groups=[{'race': 0}],
    privileged_groups=[{'race': 1}]
)

# Disparate impact < 0.8 triggers US 80% rule
print(f"Disparate Impact: {metric.disparate_impact():.3f}")
print(f"Statistical Parity Diff: {metric.statistical_parity_difference():.3f}")

Fairlearn (Microsoft)

Fairlearn focuses on classification and regression fairness, providing constraint-based in-processing (ExponentiatedGradient, GridSearch) and post-processing methods (ThresholdOptimizer). The Fairlearn dashboard (integrated with Azure ML) visualizes performance vs. fairness tradeoff curves across different mitigation approaches, helping practitioners choose the operating point that best meets their fairness requirements while maintaining acceptable accuracy. Compatible with scikit-learn API.

Disparate Impact & the 80% Rule

The US EEOC "four-fifths rule" (1978 Uniform Guidelines on Employee Selection Procedures) states that a selection rate for any protected group that is less than 80% of the rate for the group with the highest selection rate creates adverse impact. This gives a simple threshold for deployment decisions: disparate impact ratio = (selection rate for unprivileged group) / (selection rate for privileged group). A ratio below 0.8 requires justification under US employment law. The EU Employment Equality Directive uses different tests but similarly requires justification for statistically significant disparate outcomes.

SHAP for Fairness Analysis

SHAP (SHapley Additive exPlanations) enables individual-level fairness analysis by explaining each prediction in terms of feature contributions. For fairness auditing: plot SHAP values stratified by demographic group — if a protected attribute (or a proxy like zip code) has a high SHAP contribution, this indicates reliance on potentially discriminatory features. Intersectional fairness analysis with SHAP reveals whether harm compounds for individuals at the intersection of multiple disadvantaged groups (e.g., Black women vs. Black people or women separately). Audit testing (sending identical feature vectors with only the protected attribute changed) using SHAP can reveal differential treatment.

🛠 Debiasing Techniques

Stage	Technique	How It Works	Tradeoff	Tools
Pre-processing (modify training data)	Reweighting	Assign higher training weights to underrepresented group-label combinations	Minimal accuracy loss; doesn't change data distribution	AIF360 Reweighing
	Resampling / oversampling	Oversample minority group examples; undersample majority to balance representation	May cause overfitting on minority class; changes dataset size	SMOTE, AIF360
	Disparate Impact Remover	Transform feature distributions to reduce correlation with protected attribute while preserving rank-ordering within groups	Distorts feature relationships; may not generalize	AIF360
In-processing (modify training)	Adversarial debiasing	Train model to predict outcome while simultaneously training adversary to predict protected attribute from model's intermediate representations — main model learns representations that are uninformative about protected attribute	More complex training; requires careful adversary balancing	AIF360 AdversarialDebiasing, Fairlearn
In-processing (modify training)	Fairness constraints (exponentiated gradient)	Optimize model subject to explicit fairness constraints (e.g., equalized odds ≤ ε) using Lagrangian relaxation or gradient reduction	Direct control over fairness metric; computationally expensive; may not converge	Fairlearn ExponentiatedGradient
Post-processing (modify predictions)	Equalized odds post-processing	Find per-group thresholds that equalize TPR and FPR between groups by solving a linear program on validation data	No retraining needed; requires demographic information at inference time	AIF360 EqOddsPostprocessing
Post-processing (modify predictions)	Calibrated equalized odds	Post-process using calibrated group-specific thresholds; balances fairness and calibration better than raw threshold equalization	May violate calibration for individuals; requires demographic data at inference	AIF360 CalibratedEqOdds

📄 AI Governance & Regulation

EU AI Act — High-Risk Requirements

The EU AI Act (adopted 2024) classifies AI systems used in employment, credit, education, healthcare, and law enforcement as high-risk. Requirements: training data governance documentation, fundamental rights impact assessment, technical documentation, human oversight mechanisms, accuracy and robustness requirements, and transparency to affected persons. Specifically for bias: demonstrate datasets are representative of the intended use population and measure accuracy across demographic groups. High-risk systems must be registered in an EU database before deployment. Non-compliance: fines up to €30 million or 6% of global annual turnover.

NIST AI RMF & Model Cards

The NIST AI Risk Management Framework (AI RMF 1.0, 2023) addresses bias under the GOVERN, MAP, MEASURE, and MANAGE functions. Key guidance: document intended use population and limitations, measure performance across demographic subgroups, establish feedback mechanisms for identifying in-deployment bias, and provide model cards. Model Cards (Mitchell et al., 2019) are standardized documentation templates for ML models that include: intended use, factors (demographic, environmental), metrics per subgroup, evaluation data, training data, quantitative analyses, and ethical considerations. Google, Hugging Face, and major cloud providers now require model cards for published models.

US & EU Anti-Discrimination Law

US law applies multiple anti-discrimination frameworks to AI systems. EEOC Title VII and ADEA: algorithmic hiring tools must comply with disparate impact doctrine — statistical evidence of discriminatory outcomes can establish liability even without discriminatory intent. CFPB ECOA and Fair Housing Act: ML credit scoring models must provide adverse action reasons and cannot perpetuate redlining. EEOC AI guidance (2023): employer responsibility for third-party AI tools applies — deploying a biased vendor tool does not absolve the employer. EU Employment Equality Directive, GDPR Article 22, and Digital Services Act create additional frameworks for algorithmic accountability in the EU.

Bias Mitigation is Not Just a Technical Problem

Every debiasing technique embeds normative choices: which fairness metric to optimize, which groups to protect, what accuracy-fairness tradeoff is acceptable. These are not decisions that technical teams should make alone. Effective bias mitigation requires: (1) Domain expertise about where bias manifests and its consequences in the specific application context. (2) Stakeholder engagement — affected communities should have input on what fairness means in their context. (3) Legal review — which metrics matter for regulatory compliance. (4) Ongoing monitoring — models drift and new biases can emerge post-deployment. No toolkit automates the judgment required to deploy a fair AI system; tools help measure and mitigate what humans have defined as the relevant harms.