AI Security: Adversarial ML Prompt Injection Model Extraction Data Privacy Bias & Fairness Supply Chain Agentic AI Security Multimodal Attacks
← Model Extraction Bias & Fairness →
⏱ 12 min read 📊 Advanced 🗓 Updated Jan 2025

🔒 AI Privacy Threat Landscape

Machine learning models are not merely statistical summaries of their training data — they actively memorize portions of it. This memorization is not a bug but an emergent consequence of powerful models fitting their training distribution closely. The privacy consequences are severe: personal information, medical records, social security numbers, private communications, and proprietary code have all been extracted from production AI systems through careful probing. For organizations subject to GDPR, CCPA, HIPAA, or the EU AI Act, ML privacy leakage is not just a security issue — it is a regulatory compliance crisis.

The Memorization Problem

Carlini et al. (2021) demonstrated that GPT-2 memorizes and can reproduce verbatim training content including names, phone numbers, physical addresses, email addresses, and social security number-like sequences. The 2023 "Quantifying Memorization" study showed memorization scales with model size: larger models memorize significantly more, and examples that appear multiple times in training data are memorized at much higher rates. A single mention of a phone number in a training dataset rarely produces memorization; 10+ repetitions almost guarantees it.

PII ExtractionScales with Model Size

Types of Privacy Leakage

  • Direct memorization: Model reproduces verbatim training text containing PII when prompted with a prefix
  • Indirect inference: Model reveals aggregate statistics that enable de-anonymization (e.g., knowing average salary for 3-person demographic group reveals individual salaries)
  • Attribute inference: Model predicts sensitive attributes (health status, political views, sexual orientation) about an individual from non-sensitive inputs, trained on data where the correlation exists
  • Membership inference: Confirming a specific record was in the training set (covered in Model Extraction module)

Real-World Leakage Incidents

  • GPT-2/3 SSN extraction (2021): Researchers extracted training data including real names and contact info from production GPT-2 using only API access
  • Samsung ChatGPT leak (2023): Samsung engineers fed proprietary chip design code and internal meeting notes to ChatGPT; the data may have been used for model training per OpenAI's policies at the time
  • ChatGPT conversation leakage (2023): A bug briefly exposed some users' conversation titles and possibly message content to other users
  • GitHub Copilot license key reproduction (2023): Copilot reproduced verbatim API keys and secret tokens from public repositories in its training data
  • Medical LLM memorization (2024): Research showed clinical LLMs fine-tuned on patient notes could reproduce patient-specific details when prompted with partial clinical notes

⚠ Training Data Privacy Risks

Web Scraping & PII Ingestion

Large language models are trained on massive web crawls (Common Crawl, C4, The Pile) that indiscriminately include personal information: forum posts with real names, social media profiles, court records, obituaries, medical information shared in support groups, and contact directories. Despite best-effort filtering, PII routinely survives into production training sets. The scale makes comprehensive PII removal effectively impossible — Common Crawl processes petabytes of web content monthly.

Medical & Financial Data Risks

Domain-specific models trained on sensitive datasets pose compounded risk. A clinical NLP model trained on EHR data may memorize specific patient details — medication names, diagnoses, unusual clinical presentations. A fraud detection model trained on transaction history may leak account patterns. Under HIPAA, "de-identified" data must have 18 specific identifiers removed — but ML models can re-identify individuals from supposedly de-identified records when combined with auxiliary information or when queried with known attributes.

EU AI Act & Copyright Implications

The EU AI Act (effective 2024–2026) introduces transparency requirements for general-purpose AI models: providers must publish summaries of training data, honor copyright opt-outs, and implement technical measures to comply with GDPR for any personal data in training sets. Training on personal data without a lawful basis under GDPR Article 6 exposes developers to fines up to 4% of global annual turnover. The legal status of web-scraped training data is under active litigation in the EU and US as of 2025.

Data You Feed to an LLM API May Be Used for Training

Always review the data usage policies of any AI API provider before sending sensitive data. As of 2024–2025: OpenAI's API does not use inputs for training by default (opt-in only), but their consumer ChatGPT product historically used conversations for training unless opted out. AWS Bedrock and Azure OpenAI Service contractually commit to not using customer data for training. Self-hosted open-source models (Llama, Mistral) process data locally with no third-party exposure. The Samsung incident (2023) occurred when employees used the consumer ChatGPT interface, not the enterprise API — a distinction that must be communicated to all staff.

📊 Differential Privacy for ML

Differential privacy (DP) provides a mathematical guarantee about privacy: the output of a DP algorithm reveals essentially the same information whether or not any single individual's data was included. For ML, this means training a DP model provides a provable bound on how much any individual training example can influence the model — limiting memorization by construction.

The DP Guarantee

Formally, a randomized mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' differing by one record, and any output set S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S] + δ

ε = privacy budget (smaller = more private)
δ = failure probability (typically 10⁻⁵ to 10⁻⁶)

Interpretation: the model's outputs are almost
indistinguishable whether Alice's data was included
or not, up to multiplicative factor e^ε.

Smaller ε = stronger privacy guarantee = less useful model. Typical ε values in practice: 1–10 for ML (much looser than ε=0.1 used in academic DP work).

DP-SGD Implementation

Differentially Private SGD (Abadi et al., 2016) adds DP to neural network training through two mechanisms: (1) gradient clipping — clip each per-example gradient to L2 norm C, preventing any single example from dominating the update, (2) Gaussian noise — add N(0, σ²C²I) noise to the summed gradient, providing formal DP guarantees. The privacy accountant (Rényi DP, PRV accountant) tracks cumulative privacy budget across all training steps. Available implementations: TensorFlow Privacy (Google), Opacus (Meta/PyTorch).

# Opacus (PyTorch) — minimal example
from opacus import PrivacyEngine

privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    epochs=EPOCHS,
    target_epsilon=5.0,    # ε budget
    target_delta=1e-5,     # δ failure prob
    max_grad_norm=1.0,     # clipping norm C
)

Privacy-Utility Tradeoff

Adding DP noise degrades model accuracy — the fundamental cost of privacy. The tradeoff depends on dataset size, model architecture, epsilon, and task difficulty. Larger datasets tolerate stronger privacy (smaller ε) with less accuracy loss — DP "amortizes" across more examples. Models trained with DP need more computation to reach the same accuracy as non-DP models. Fine-tuning (starting from a pre-trained model) with DP is much more efficient than training from scratch with DP.

ε ValuePrivacy LevelTypical Accuracy ImpactUse Case
0.1 – 1.0Very strong — textbook DPSevere (5–20% accuracy loss)Medical/clinical records, census data — highest sensitivity
1.0 – 3.0StrongModerate (2–8% accuracy loss)Financial records, HR data — high sensitivity
3.0 – 8.0Moderate — practical rangeMinor (1–4% accuracy loss)Enterprise customer data, user behavior — medium sensitivity
8.0 – 10.0Weak — still better than no DPMinimal (<2% accuracy loss)Public/aggregate statistics — lower sensitivity
>10Very weak — marginal protectionNegligibleRarely justified — limited meaningful protection

🌐 Federated Learning & Privacy

Federated Learning (FL) enables training ML models on distributed private data without centralizing the raw data. Each participating device trains locally on its data and shares only model updates (gradients) with a central server. The server aggregates updates using FedAvg or similar algorithms and distributes the updated global model. The promise: model utility without data exposure. The reality: FL introduces new privacy and security challenges.

How FL Works

Standard Federated Averaging (FedAvg) protocol: (1) Server distributes current global model to a sample of clients. (2) Each client trains on its local data for E epochs, computing local model updates. (3) Clients send model updates (gradients or weight deltas) to the server. (4) Server aggregates updates (weighted average by dataset size) to produce new global model. (5) Repeat. Real deployments: Google Gboard (keyboard prediction), Apple's on-device personalization, healthcare consortium models. FL operates over millions of clients in production.

Gradient Inversion Attacks

Zhu et al. (2019) demonstrated that shared gradients can be used to reconstruct the original training batch nearly perfectly — a complete failure of FL's privacy promise. The attack (Deep Leakage from Gradients, DLG) optimizes dummy data to match the observed gradient. R-GAP extended this to larger batches. Even with gradient compression and quantization, reconstruction remains possible. The primary mitigation is Secure Aggregation (SecAgg) — a cryptographic protocol that ensures the server only sees the aggregate gradient, never individual clients' updates — but SecAgg adds significant communication and computation overhead.

Privacy Amplification by Subsampling

FL with DP provides stronger privacy guarantees through "privacy amplification by subsampling": when only a random subset of clients participates in each round, the privacy loss per round is amplified relative to the subsampling probability. If each client participates in only q fraction of rounds, the effective ε per round is roughly O(q·ε) rather than ε. Combined with DP-SGD on-device, this allows FL systems to achieve meaningful DP guarantees (ε ≈ 1–3) with minimal accuracy impact on large client populations.

🛡 Privacy-Preserving Practices

Data Minimization Before Training

Audit and minimize PII in training datasets before model training begins. Tools: Microsoft Presidio (rule-based + ML PII detector for 50+ entity types), spaCy NER (named entity recognition for identifying persons, organizations, locations), AWS Comprehend entity detection, Google Cloud DLP. Apply redaction or replacement (e.g., replace "John Smith" with "[PERSON]") rather than deletion to preserve linguistic patterns. Document what data types were found and how they were handled — required for GDPR Article 30 Records of Processing Activity.

GDPR ComplianceData Governance

Synthetic Data as a Substitute

Generate synthetic training data statistically similar to real data but not derived from any real individual. Tools: Synthpop (R), SDV (Synthetic Data Vault, Python), CTGAN and TVAE for tabular data, GretelAI and mostly.ai for enterprise use. Synthetic data eliminates direct privacy risk but may introduce distributional shift and can inherit biases from the original data. Fidelity evaluation: compare marginal distributions, correlation matrices, and ML utility metrics between real and synthetic data. Privacy evaluation: run membership inference attacks against models trained on synthetic data.

Zero Raw Data RiskFidelity Tradeoff

Output Filtering for PII

Post-generation filtering detects and redacts PII in LLM outputs before they reach users. This is a last line of defense — it doesn't prevent the model from memorizing PII, but it prevents extraction through normal use. Implement: run Presidio or a custom NER model on all LLM outputs; redact or replace identified entities; log all filtering events for audit. Limitations: sophisticated extraction attempts (asking the model to rephrase, paraphrase, or translate content) may evade surface-level PII detectors. Output filtering is necessary but not sufficient — combine with DP training and data minimization.

Defense in DepthBypassable Alone

EU AI Act High-Risk System Requirements

The EU AI Act classifies ML systems in sensitive domains (employment, credit, education, healthcare, biometrics, law enforcement) as "high-risk," triggering mandatory requirements: data governance and training data documentation, detailed technical documentation and record-keeping, transparency to users, human oversight mechanisms, robustness and accuracy requirements, and registration in an EU database. For training data specifically: demonstrate the data used is relevant, representative, free from errors "to the extent possible," and that appropriate data governance practices were applied — including privacy measures proportional to the personal data involved.

PracticePrivacy Risk AddressedTool / FrameworkGDPR Article
PII detection & redaction pre-trainingDirect memorization of PIIPresidio, spaCy, AWS ComprehendArt. 5 (data minimization), Art. 25 (privacy by design)
Differential privacy (DP-SGD)Memorization, membership inferenceOpacus, TensorFlow PrivacyArt. 25, Art. 32 (security of processing)
Synthetic data generationAll PII risks (indirect replacement)SDV, Gretel, mostly.aiArt. 5, Art. 6 (lawful basis)
Output PII filteringExtraction at inference timePresidio, custom NERArt. 32
Federated learning + SecAggRaw data exposure during trainingTensorFlow Federated, PySyftArt. 25, Art. 32
Privacy impact assessmentSystematic risk identificationDPIA template (CNIL, ICO)Art. 35 (mandatory for high-risk)
MLflow / DVC audit trailsAccountability & data lineageMLflow, DVCArt. 30 (records of processing)

Conduct a Data Privacy Audit Before Every Training Run

Before initiating any training run on data that may contain personal information, complete a structured audit: (1) Identify all data sources and their PII content. (2) Verify the lawful basis under GDPR Art. 6 for each data source. (3) Apply PII detection and document what was found and how handled. (4) Determine if a DPIA (Data Protection Impact Assessment) is required under Art. 35. (5) Select and apply appropriate privacy-preserving techniques (DP-SGD, synthetic data, or data minimization). (6) Document the audit in your Records of Processing Activity. This process protects your organization legally, identifies risks early when they are cheapest to fix, and builds the documentation trail required by regulators.