AI Security: Adversarial ML Prompt Injection Model Extraction Data Privacy Bias & Fairness Supply Chain Agentic AI Security Multimodal Attacks
← Back to AI & ML Prompt Injection →
⏱ 12 min read πŸ“Š Advanced πŸ—“ Updated Jan 2025

⚠ The Adversarial Threat Landscape

Adversarial machine learning is the study of attacks that intentionally manipulate ML systems to produce incorrect outputs, and the defenses against those attacks. Unlike traditional software bugs, adversarial vulnerabilities are often fundamental to how neural networks generalize β€” making them extraordinarily difficult to fully eliminate. The field emerged prominently in 2013 with Szegedy et al.'s discovery that imperceptibly small pixel perturbations could reliably flip a neural network's classification.

Attacker Knowledge Model

The threat model defines how much an attacker knows about the target system:

  • White-box: Full access to model architecture, weights, and gradients. Attacker can compute exact adversarial examples. Represents an insider or model leak scenario.
  • Grey-box: Knows architecture but not weights, or has partial information. Attacks rely on surrogate models or limited gradient estimates.
  • Black-box: Only query access β€” sees inputs and outputs (possibly confidence scores). Most realistic for cloud ML APIs. Relies on transferability and finite-difference gradient estimation.

Attack Goals

Adversarial attacks differ by what the attacker wants to achieve:

  • Untargeted: Cause any misclassification. Easier to achieve; sufficient for bypassing content moderation or safety classifiers.
  • Targeted: Force a specific wrong prediction (e.g., classify a malware sample as benign, or misidentify a face as a specific person). Harder but more useful to attackers.
  • Confidence reduction: Lower model confidence below a threshold to trigger a fallback path or human review β€” useful for evading fraud detection.
  • Availability attack: Degrade model accuracy across all inputs (poisoning) to render it unusable.

Why Neural Networks Are Vulnerable

Several theoretical explanations have been proposed:

  • Linearity hypothesis (Goodfellow et al., 2014): High-dimensional linear models accumulate small perturbations across many dimensions, making FGSM attacks possible.
  • High-dimensional geometry: In high dimensions, the volume near a decision boundary is enormous β€” tiny steps in the input space can cross boundaries.
  • Excessive feature reliance: Ilyas et al. (2019) argue adversarial examples exploit "non-robust features" that are genuinely predictive but incomprehensible to humans.
  • Overparameterization: Models fit training data so tightly that they learn brittle shortcut features that don't generalize to perturbed inputs.

Real-World Incidents & Research

YearIncident / ResearchImpactType
2017Physical adversarial stop signs (Eykholt et al., UMich)Stop sign misclassified as 45 mph at 100% rate under various conditionsEvasion β€” physical world
2019Fooling face recognition with printed glasses (CMU)Impersonated specific target individuals with physical propsTargeted evasion
2020Teslabot lane-tape attack (McAfee research)Small tape on road caused lane detection errors in Tesla AutopilotPhysical evasion
2021Backdoor attacks on NLP models (hidden trigger research)Sentiment classifiers flipped output on specific rare tokensTrojan / backdoor
2022DALL-E/Stable Diffusion evasionAdversarial prompts bypassed content filters in generative image modelsEvasion β€” generative
2023–24Adversarial patches in autonomous driving benchmarksPrinted patches caused object detection failures on Waymo/KITTI datasetsPhysical evasion

⚡ Evasion Attacks

Evasion attacks craft malicious inputs at inference time β€” the model is already deployed, and the attacker wants to cause it to misclassify their specific input. This is distinct from poisoning (which attacks training). Evasion is the dominant threat for deployed classifiers in security tooling, content moderation, and autonomous systems.

FGSM β€” Fast Gradient Sign Method

Introduced by Goodfellow et al. (2014), FGSM is the simplest white-box evasion attack. It perturbs every input feature in the direction that maximally increases the loss, scaled by epsilon:

x_adv = x + Ξ΅ Β· sign(βˆ‡_x J(ΞΈ, x, y))

Where:
  x       = original input
  Ξ΅       = perturbation budget (e.g. 8/255 for images)
  βˆ‡_x J   = gradient of loss w.r.t. input
  sign(Β·) = element-wise sign function
  y       = true label

FGSM is a single-step attack. It's fast but not the strongest β€” it's primarily used as a baseline and for adversarial training data generation. Epsilon controls the trade-off between attack strength and human perceptibility.

PGD β€” Projected Gradient Descent

Madry et al. (2018) showed that iterating FGSM with projection onto the L∞ epsilon-ball produces much stronger adversarial examples:

x^(t+1) = Ξ _{x+S}(x^t + Ξ± Β· sign(βˆ‡_x J(ΞΈ, x^t, y)))

Where:
  Ξ  = projection onto the Ξ΅-ball
  Ξ± = step size per iteration
  S = allowed perturbation set
  t = iteration number

PGD with random restarts is considered the strongest first-order attack and the gold standard for adversarial training. Typically run for 20–100 steps. PGD-AT (adversarial training with PGD) remains the most reliable empirical defense.

C&W Attack

Carlini & Wagner (2017) formulated adversarial examples as an optimization problem that directly minimizes perturbation size:

minimize  β€–Ξ΄β€–_p + c Β· f(x + Ξ΄)
subject to  x + δ ∈ [0,1]^n

f(x') = max(max{Z(x')_i : i≠t} - Z(x')_t, -κ)

Where:
  Z(x') = pre-softmax logits
  t     = target class
  ΞΊ     = confidence margin

C&W is slower but finds tighter perturbations than PGD. It broke many early defenses that PGD couldn't crack. The L2, L0, and L∞ variants each have different use cases. C&W attacks are often used to evaluate defense robustness and to generate imperceptible targeted examples.

Physical-World Adversarial Examples

Digital adversarial examples are powerful but require direct access to model inputs. Physical-world attacks must survive printing, lighting changes, viewing angle variations, and camera noise β€” making them harder to construct but more dangerous for real deployments. Robust Physical Perturbations (RP2) and adversarial patches can reliably fool deployed computer vision systems even when captured by cameras under varying conditions.

Autonomous Vehicle Attacks

Researchers have demonstrated attacks on production systems: stop signs with stickers misclassified as speed limit signs, lane markings perturbed with tape to cause lane-keeping failures, and LiDAR spoofing combined with visual adversarial patches. The 2020 McAfee study showed that adding small black rectangles to a 35 mph speed limit sign caused a Tesla to read it as 85 mph. These attacks have prompted significant investment in robust perception systems from AV manufacturers.

Face Recognition Evasion

Physically realizable attacks on commercial face recognition include specially printed glasses (Sharif et al., 2016, CMU), adversarial makeup patterns, and infrared LED arrays embedded in hats that blind near-infrared cameras. The glasses attack achieved >80% targeted impersonation success against FaceNet. IR attacks work because many surveillance cameras use NIR illumination that's invisible to humans but captured by the sensor β€” adversarial IR patterns disrupt face detection entirely.

Attack Methods Comparison

AttackTypeQuery BudgetStrengthExecution Difficulty
FGSMWhite-box, L∞1 gradient evalLow–MediumEasy β€” one-line implementation
PGD-20White-box, L∞20 gradient evalsHighEasy β€” available in all major adversarial ML libraries
C&W L2White-box, L2Hundreds of iterationsVery HighMedium β€” requires careful hyperparameter tuning
AutoAttackWhite-box, ensembleVery highState-of-artEasy β€” automated evaluation standard
Boundary AttackBlack-box, decision-based25,000+ queriesMediumMedium β€” high query count may trigger rate limits
Square AttackBlack-box, score-based5,000–10,000 queriesHighMedium β€” efficient score-based black-box attack
Transfer AttackBlack-box, transferability0 target queriesLow–MediumMedium β€” requires similar surrogate model
Physical patchPhysical worldAccess to print/deployHigh in-sceneHard β€” requires physical access and iteration

Transferability: The Black-Box Multiplier

Adversarial examples crafted against one model often transfer to other models trained on the same data distribution, even with different architectures. Papernot et al. (2016) demonstrated >84% transfer rates in some settings. This means an attacker who can only query a black-box API can still craft effective attacks by training a local surrogate model. Ensemble adversarial training and architectural diversity reduce (but don't eliminate) transferability.

☣ Poisoning Attacks

Poisoning attacks target the training phase rather than inference. An attacker who can influence training data can cause the resulting model to have degraded accuracy, misclassify specific inputs, or harbor hidden backdoors that activate only on special trigger inputs. These attacks are particularly dangerous because the poisoned model may perform normally on standard test sets, making the compromise invisible during evaluation.

Indiscriminate Poisoning

The attacker wants to degrade overall model accuracy β€” an availability attack. By injecting training examples with corrupted labels or features, accuracy on the test distribution drops. Effective against federated learning where individual participants can contribute poisoned updates. Harder against centralized training with data validation, but SEO manipulation of web-crawled datasets can achieve this at scale without direct dataset access.

AvailabilityFederated Learning

Clean-Label Poisoning

Witches' Brew (Geiping et al., 2021) and related work showed that an attacker can poison a model using correctly labeled examples β€” no label manipulation required. The poisoned images look normal to human reviewers but are adversarially crafted to shift the model's decision boundary. This is particularly dangerous for data pipelines that use human spot-checking as a quality control mechanism, since all labels appear legitimate.

StealthyNo Label Manipulation

Backdoor / Trojan Attacks

A backdoor attack embeds a secret trigger in the model during training. The model behaves normally on clean inputs but produces attacker-specified outputs whenever the trigger is present. The trigger can be a specific pixel pattern, a watermark, a phrase in text, or even a semantic concept. Neural Cleanse (Wang et al., 2019) and ABS (Artificial Brain Stimulation) are detection methods, but state-of-the-art trojans evade both. Trojanvision benchmarks show detection rates below 70% for adaptive attacks.

Hidden TriggerHard to Detect

Federated Learning as Amplified Attack Surface

Federated learning (FL) is designed to train models without centralizing raw data, but it introduces a new attack surface: each participating client can submit poisoned model updates. Bhagoji et al. (2019) demonstrated model replacement attacks where a malicious participant scales up their poisoned gradient updates to overcome aggregation defenses. Coordinate-wise median and Krum aggregation provide some defense but can be bypassed by adaptive attackers. The aggregate model may contain backdoors that activate on specific clients' data β€” allowing targeted misclassification in healthcare or financial FL deployments.

NLP Backdoor Attacks

Text-domain trojans use insertion triggers: rare words, specific phrases, or syntactic patterns that activate the backdoor. BadNL inserts trigger words into training sentences. More sophisticated attacks use invisible Unicode characters, homoglyphs, or paraphrasing-based triggers that survive surface-level detection. Sentiment classifiers and toxicity detectors are prime targets since attackers can make a model always classify content containing the trigger as benign β€” enabling harmful content to evade moderation.

Supply Chain Dataset Poisoning

Large models are trained on web-crawled data (Common Crawl, LAION, The Pile). Researchers demonstrated "sleeper agent" poisoning by publishing web pages optimized to appear in crawls and contain adversarial training examples. Carlini et al. (2023) showed that poisoning just 0.01% of a 180GB web-crawled dataset was sufficient to implant a backdoor in an image classifier. This attack costs under $60 in cloud storage, making it accessible to motivated adversaries targeting open-source model development.

🔒 Model Inversion & Membership Inference

Beyond causing misclassifications, adversarial attacks can also extract private information from models. Model inversion reconstructs sensitive training data from model outputs. Membership inference determines whether a specific data point was used in training β€” a serious privacy violation with GDPR implications.

Model Inversion Attacks

Fredrikson et al. (2015) demonstrated inverting a pharmacogenetics model to reconstruct patient genomic features. For face recognition models, inversion produces recognizable face images of training subjects using only API access. The attack optimizes an input to maximize the model's confidence for a target class, effectively gradient-ascending in input space. GAN-based inversion (GMI, KED-MI) dramatically improved reconstruction quality β€” producing photo-realistic training faces from black-box APIs with only confidence scores.

PII ExtractionGDPR Risk

Membership Inference

Shokri et al. (2017) introduced shadow model attacks: train multiple models on known data and learn a meta-classifier that distinguishes training vs non-training examples based on confidence patterns. Models overfit to training data β€” members typically receive higher confidence than non-members. LiRA (Carlini et al., 2022) is the current state-of-the-art: it computes likelihood ratios using shadow models and achieves near-perfect TPR at low FPR on overfit models. Differential privacy during training provides the strongest theoretical defense.

Privacy ViolationGDPR Art. 17

Gradient Leakage in Federated Learning

Zhu et al. (2019) showed that in federated learning, shared gradients can be used to reconstruct the original training batch with near-perfect fidelity β€” a technique called Deep Leakage from Gradients (DLG). Even quantized gradients or gradient compression doesn't fully prevent this. R-GAP extended the attack to larger batch sizes. SecAgg (secure aggregation) cryptographically prevents the server from seeing individual gradients and is the primary mitigation, but it adds significant communication overhead.

Gradient LeakageFederated Learning

LLM Memorization of PII

Carlini et al. (2021) demonstrated that GPT-2 memorizes and can reproduce verbatim training data including names, phone numbers, addresses, and email addresses. The "extractable memorization" rate scales with model size and training repetitions β€” larger models trained on repeated data memorize more. The 2023 paper "Quantifying Memorization Across Neural Language Models" showed that scaling laws apply to memorization: a 6.7B parameter model memorized >1% of its training set. This has direct implications for GDPR compliance β€” personal data in LLM training sets may be reproducible on demand, violating data minimization and right-to-erasure principles.

GDPR Implications

If a model has memorized personal data and that data can be extracted via membership inference or model inversion, the organization may be unable to comply with GDPR Article 17 (right to erasure) without full model retraining. Machine unlearning is an active research area but computationally expensive and imperfect. Conducting Privacy Impact Assessments before training on personal data is not just best practice β€” it may be legally required under GDPR Article 35 for high-risk processing.

🛡 Defenses & Robustness

Adversarial defense is an arms race. Many defenses that appeared robust were later broken by adaptive attacks designed specifically against them. The adversarial ML community now demands evaluation against adaptive white-box attacks β€” defenses that only obscure gradients (gradient masking) rather than truly improving robustness are routinely circumvented. RobustBench maintains a public leaderboard of verified robust models against AutoAttack.

Adversarial Training (PGD-AT)

The most reliable empirical defense: include adversarially perturbed examples in the training set. Madry et al. formulated this as a minimax problem β€” find model parameters that minimize loss on the worst-case perturbation of each training example. PGD-AT with a large perturbation budget (Ξ΅=8/255 for ImageNet) consistently produces the most robust models on leaderboards. Computational cost is ~10x standard training. TRADES (Zhang et al., 2019) improved the accuracy-robustness tradeoff by explicitly regularizing the KL divergence between clean and perturbed predictions.

Most ReliableHigh Compute Cost

Certified Defenses & Randomized Smoothing

Certified defenses provide mathematical guarantees β€” provably no L2 perturbation below radius r can change the prediction. Cohen et al. (2019) showed that adding Gaussian noise to inputs before classification and using majority voting (randomized smoothing) provides certified L2 robustness. The certified radius scales with the noise level β€” higher noise gives larger certificates but hurts clean accuracy. IBP (Interval Bound Propagation) and CROWN-IBP give certificates for small L∞ perturbations on small networks. Certified robustness on ImageNet at Ξ΅=0.5 L2 still lags clean accuracy by ~20%.

Provable GuaranteeAccuracy Tradeoff

Input Preprocessing Defenses

Feature squeezing (Xu et al., 2018): reduce color depth or apply spatial smoothing to remove adversarial perturbations, then compare the model's outputs on the original and squeezed inputs β€” large divergence signals an adversarial example. JPEG compression, denoising autoencoders, and diffusion-model purification (DiffPure, 2022) are other preprocessing approaches. DiffPure achieved strong results by denoising inputs using a pre-trained diffusion model before classification. However, adaptive attacks that account for the purification step can often still succeed, making evaluation tricky.

Detection-BasedAdaptive Attack Vulnerable

Defense Methods Summary

DefenseMechanismRobustness TypeAccuracy CostAdaptive Attack Resistant?
PGD-ATAdversarial trainingEmpirical (L∞)~10–15% clean acc lossPartially β€” strongest empirical defense
TRADESRegularized adv. trainingEmpirical (L∞)~8–12% clean acc lossPartially β€” improved tradeoff vs PGD-AT
Randomized SmoothingGaussian noise + majority voteCertified (L2)~15–25% clean acc lossYes β€” provable guarantee within radius
IBP / CROWN-IBPBound propagationCertified (L∞ small Ξ΅)High β€” practical only for small netsYes β€” provable but limited scale
Feature SqueezingInput preprocessing + detectionDetectionMinimal on clean inputsNo β€” broken by adaptive attacks
DiffPureDiffusion model denoisingEmpiricalInference slowdownPartially β€” strong but adaptive attacks exist
Ensemble DiversityMultiple diverse modelsEmpirical β€” reduces transferMarginalPartially β€” reduces transferability

NIST AI RMF on Adversarial Robustness

The NIST AI Risk Management Framework (AI RMF 1.0, 2023) addresses adversarial robustness under the MANAGE function. Key guidance includes: conducting red-team evaluations using diverse attack methods, documenting robustness results in model cards, establishing robustness benchmarks for high-stakes deployments, and tracking emerging adversarial techniques through threat intelligence. NIST SP 800-231 (draft, 2024) provides specific guidance on adversarial ML threat models and mitigations aligned with the AI RMF.

The Accuracy-Robustness Tradeoff

Adversarial robustness and standard accuracy are fundamentally in tension. A model adversarially trained to be robust to Ρ=8/255 L∞ perturbations on CIFAR-10 typically achieves ~84% clean accuracy vs ~95% for a standard model. This tradeoff has theoretical foundations: Tsipras et al. (2019) showed that for some distributions, robust and accurate classifiers must use different features. In practice, this means deploying adversarially robust models involves a conscious business decision: how much clean-accuracy degradation is acceptable for the robustness gained? Never deploy adversarial defenses without characterizing both their robustness guarantees AND their impact on normal-case performance.