Adversarial Machine Learning | AI Security

⚠ The Adversarial Threat Landscape

Adversarial machine learning is the study of attacks that intentionally manipulate ML systems to produce incorrect outputs, and the defenses against those attacks. Unlike traditional software bugs, adversarial vulnerabilities are often fundamental to how neural networks generalize — making them extraordinarily difficult to fully eliminate. The field emerged prominently in 2013 with Szegedy et al.'s discovery that imperceptibly small pixel perturbations could reliably flip a neural network's classification.

Attacker Knowledge Model

The threat model defines how much an attacker knows about the target system:

White-box: Full access to model architecture, weights, and gradients. Attacker can compute exact adversarial examples. Represents an insider or model leak scenario.
Grey-box: Knows architecture but not weights, or has partial information. Attacks rely on surrogate models or limited gradient estimates.
Black-box: Only query access — sees inputs and outputs (possibly confidence scores). Most realistic for cloud ML APIs. Relies on transferability and finite-difference gradient estimation.

Attack Goals

Adversarial attacks differ by what the attacker wants to achieve:

Untargeted: Cause any misclassification. Easier to achieve; sufficient for bypassing content moderation or safety classifiers.
Targeted: Force a specific wrong prediction (e.g., classify a malware sample as benign, or misidentify a face as a specific person). Harder but more useful to attackers.
Confidence reduction: Lower model confidence below a threshold to trigger a fallback path or human review — useful for evading fraud detection.
Availability attack: Degrade model accuracy across all inputs (poisoning) to render it unusable.

Why Neural Networks Are Vulnerable

Several theoretical explanations have been proposed:

Linearity hypothesis (Goodfellow et al., 2014): High-dimensional linear models accumulate small perturbations across many dimensions, making FGSM attacks possible.
High-dimensional geometry: In high dimensions, the volume near a decision boundary is enormous — tiny steps in the input space can cross boundaries.
Excessive feature reliance: Ilyas et al. (2019) argue adversarial examples exploit "non-robust features" that are genuinely predictive but incomprehensible to humans.
Overparameterization: Models fit training data so tightly that they learn brittle shortcut features that don't generalize to perturbed inputs.

Real-World Incidents & Research

Year	Incident / Research	Impact	Type
2017	Physical adversarial stop signs (Eykholt et al., UMich)	Stop sign misclassified as 45 mph at 100% rate under various conditions	Evasion — physical world
2019	Fooling face recognition with printed glasses (CMU)	Impersonated specific target individuals with physical props	Targeted evasion
2020	Teslabot lane-tape attack (McAfee research)	Small tape on road caused lane detection errors in Tesla Autopilot	Physical evasion
2021	Backdoor attacks on NLP models (hidden trigger research)	Sentiment classifiers flipped output on specific rare tokens	Trojan / backdoor
2022	DALL-E/Stable Diffusion evasion	Adversarial prompts bypassed content filters in generative image models	Evasion — generative
2023–24	Adversarial patches in autonomous driving benchmarks	Printed patches caused object detection failures on Waymo/KITTI datasets	Physical evasion

⚡ Evasion Attacks

Evasion attacks craft malicious inputs at inference time — the model is already deployed, and the attacker wants to cause it to misclassify their specific input. This is distinct from poisoning (which attacks training). Evasion is the dominant threat for deployed classifiers in security tooling, content moderation, and autonomous systems.

FGSM — Fast Gradient Sign Method

Introduced by Goodfellow et al. (2014), FGSM is the simplest white-box evasion attack. It perturbs every input feature in the direction that maximally increases the loss, scaled by epsilon:

x_adv = x + ε · sign(∇_x J(θ, x, y))

Where:
  x       = original input
  ε       = perturbation budget (e.g. 8/255 for images)
  ∇_x J   = gradient of loss w.r.t. input
  sign(·) = element-wise sign function
  y       = true label

FGSM is a single-step attack. It's fast but not the strongest — it's primarily used as a baseline and for adversarial training data generation. Epsilon controls the trade-off between attack strength and human perceptibility.

PGD — Projected Gradient Descent

Madry et al. (2018) showed that iterating FGSM with projection onto the L∞ epsilon-ball produces much stronger adversarial examples:

x^(t+1) = Π_{x+S}(x^t + α · sign(∇_x J(θ, x^t, y)))

Where:
  Π = projection onto the ε-ball
  α = step size per iteration
  S = allowed perturbation set
  t = iteration number

PGD with random restarts is considered the strongest first-order attack and the gold standard for adversarial training. Typically run for 20–100 steps. PGD-AT (adversarial training with PGD) remains the most reliable empirical defense.

C&W Attack

Carlini & Wagner (2017) formulated adversarial examples as an optimization problem that directly minimizes perturbation size:

minimize  ‖δ‖_p + c · f(x + δ)
subject to  x + δ ∈ [0,1]^n

f(x') = max(max{Z(x')_i : i≠t} - Z(x')_t, -κ)

Where:
  Z(x') = pre-softmax logits
  t     = target class
  κ     = confidence margin

C&W is slower but finds tighter perturbations than PGD. It broke many early defenses that PGD couldn't crack. The L2, L0, and L∞ variants each have different use cases. C&W attacks are often used to evaluate defense robustness and to generate imperceptible targeted examples.

Physical-World Adversarial Examples

Digital adversarial examples are powerful but require direct access to model inputs. Physical-world attacks must survive printing, lighting changes, viewing angle variations, and camera noise — making them harder to construct but more dangerous for real deployments. Robust Physical Perturbations (RP2) and adversarial patches can reliably fool deployed computer vision systems even when captured by cameras under varying conditions.

Autonomous Vehicle Attacks

Researchers have demonstrated attacks on production systems: stop signs with stickers misclassified as speed limit signs, lane markings perturbed with tape to cause lane-keeping failures, and LiDAR spoofing combined with visual adversarial patches. The 2020 McAfee study showed that adding small black rectangles to a 35 mph speed limit sign caused a Tesla to read it as 85 mph. These attacks have prompted significant investment in robust perception systems from AV manufacturers.

Face Recognition Evasion

Physically realizable attacks on commercial face recognition include specially printed glasses (Sharif et al., 2016, CMU), adversarial makeup patterns, and infrared LED arrays embedded in hats that blind near-infrared cameras. The glasses attack achieved >80% targeted impersonation success against FaceNet. IR attacks work because many surveillance cameras use NIR illumination that's invisible to humans but captured by the sensor — adversarial IR patterns disrupt face detection entirely.

Attack Methods Comparison

Attack	Type	Query Budget	Strength	Execution Difficulty
FGSM	White-box, L∞	1 gradient eval	Low–Medium	Easy — one-line implementation
PGD-20	White-box, L∞	20 gradient evals	High	Easy — available in all major adversarial ML libraries
C&W L2	White-box, L2	Hundreds of iterations	Very High	Medium — requires careful hyperparameter tuning
AutoAttack	White-box, ensemble	Very high	State-of-art	Easy — automated evaluation standard
Boundary Attack	Black-box, decision-based	25,000+ queries	Medium	Medium — high query count may trigger rate limits
Square Attack	Black-box, score-based	5,000–10,000 queries	High	Medium — efficient score-based black-box attack
Transfer Attack	Black-box, transferability	0 target queries	Low–Medium	Medium — requires similar surrogate model
Physical patch	Physical world	Access to print/deploy	High in-scene	Hard — requires physical access and iteration

Transferability: The Black-Box Multiplier

Adversarial examples crafted against one model often transfer to other models trained on the same data distribution, even with different architectures. Papernot et al. (2016) demonstrated >84% transfer rates in some settings. This means an attacker who can only query a black-box API can still craft effective attacks by training a local surrogate model. Ensemble adversarial training and architectural diversity reduce (but don't eliminate) transferability.

☣ Poisoning Attacks

Poisoning attacks target the training phase rather than inference. An attacker who can influence training data can cause the resulting model to have degraded accuracy, misclassify specific inputs, or harbor hidden backdoors that activate only on special trigger inputs. These attacks are particularly dangerous because the poisoned model may perform normally on standard test sets, making the compromise invisible during evaluation.

Indiscriminate Poisoning

The attacker wants to degrade overall model accuracy — an availability attack. By injecting training examples with corrupted labels or features, accuracy on the test distribution drops. Effective against federated learning where individual participants can contribute poisoned updates. Harder against centralized training with data validation, but SEO manipulation of web-crawled datasets can achieve this at scale without direct dataset access.

AvailabilityFederated Learning

Clean-Label Poisoning

Witches' Brew (Geiping et al., 2021) and related work showed that an attacker can poison a model using correctly labeled examples — no label manipulation required. The poisoned images look normal to human reviewers but are adversarially crafted to shift the model's decision boundary. This is particularly dangerous for data pipelines that use human spot-checking as a quality control mechanism, since all labels appear legitimate.

StealthyNo Label Manipulation

Backdoor / Trojan Attacks

A backdoor attack embeds a secret trigger in the model during training. The model behaves normally on clean inputs but produces attacker-specified outputs whenever the trigger is present. The trigger can be a specific pixel pattern, a watermark, a phrase in text, or even a semantic concept. Neural Cleanse (Wang et al., 2019) and ABS (Artificial Brain Stimulation) are detection methods, but state-of-the-art trojans evade both. Trojanvision benchmarks show detection rates below 70% for adaptive attacks.

Hidden TriggerHard to Detect

Federated Learning as Amplified Attack Surface

Federated learning (FL) is designed to train models without centralizing raw data, but it introduces a new attack surface: each participating client can submit poisoned model updates. Bhagoji et al. (2019) demonstrated model replacement attacks where a malicious participant scales up their poisoned gradient updates to overcome aggregation defenses. Coordinate-wise median and Krum aggregation provide some defense but can be bypassed by adaptive attackers. The aggregate model may contain backdoors that activate on specific clients' data — allowing targeted misclassification in healthcare or financial FL deployments.

NLP Backdoor Attacks

Text-domain trojans use insertion triggers: rare words, specific phrases, or syntactic patterns that activate the backdoor. BadNL inserts trigger words into training sentences. More sophisticated attacks use invisible Unicode characters, homoglyphs, or paraphrasing-based triggers that survive surface-level detection. Sentiment classifiers and toxicity detectors are prime targets since attackers can make a model always classify content containing the trigger as benign — enabling harmful content to evade moderation.

Supply Chain Dataset Poisoning

Large models are trained on web-crawled data (Common Crawl, LAION, The Pile). Researchers demonstrated "sleeper agent" poisoning by publishing web pages optimized to appear in crawls and contain adversarial training examples. Carlini et al. (2023) showed that poisoning just 0.01% of a 180GB web-crawled dataset was sufficient to implant a backdoor in an image classifier. This attack costs under $60 in cloud storage, making it accessible to motivated adversaries targeting open-source model development.

🔒 Model Inversion & Membership Inference

Beyond causing misclassifications, adversarial attacks can also extract private information from models. Model inversion reconstructs sensitive training data from model outputs. Membership inference determines whether a specific data point was used in training — a serious privacy violation with GDPR implications.

Model Inversion Attacks

Fredrikson et al. (2015) demonstrated inverting a pharmacogenetics model to reconstruct patient genomic features. For face recognition models, inversion produces recognizable face images of training subjects using only API access. The attack optimizes an input to maximize the model's confidence for a target class, effectively gradient-ascending in input space. GAN-based inversion (GMI, KED-MI) dramatically improved reconstruction quality — producing photo-realistic training faces from black-box APIs with only confidence scores.

PII ExtractionGDPR Risk

Membership Inference

Shokri et al. (2017) introduced shadow model attacks: train multiple models on known data and learn a meta-classifier that distinguishes training vs non-training examples based on confidence patterns. Models overfit to training data — members typically receive higher confidence than non-members. LiRA (Carlini et al., 2022) is the current state-of-the-art: it computes likelihood ratios using shadow models and achieves near-perfect TPR at low FPR on overfit models. Differential privacy during training provides the strongest theoretical defense.

Privacy ViolationGDPR Art. 17

Gradient Leakage in Federated Learning

Zhu et al. (2019) showed that in federated learning, shared gradients can be used to reconstruct the original training batch with near-perfect fidelity — a technique called Deep Leakage from Gradients (DLG). Even quantized gradients or gradient compression doesn't fully prevent this. R-GAP extended the attack to larger batch sizes. SecAgg (secure aggregation) cryptographically prevents the server from seeing individual gradients and is the primary mitigation, but it adds significant communication overhead.

Gradient LeakageFederated Learning

LLM Memorization of PII

Carlini et al. (2021) demonstrated that GPT-2 memorizes and can reproduce verbatim training data including names, phone numbers, addresses, and email addresses. The "extractable memorization" rate scales with model size and training repetitions — larger models trained on repeated data memorize more. The 2023 paper "Quantifying Memorization Across Neural Language Models" showed that scaling laws apply to memorization: a 6.7B parameter model memorized >1% of its training set. This has direct implications for GDPR compliance — personal data in LLM training sets may be reproducible on demand, violating data minimization and right-to-erasure principles.

GDPR Implications

If a model has memorized personal data and that data can be extracted via membership inference or model inversion, the organization may be unable to comply with GDPR Article 17 (right to erasure) without full model retraining. Machine unlearning is an active research area but computationally expensive and imperfect. Conducting Privacy Impact Assessments before training on personal data is not just best practice — it may be legally required under GDPR Article 35 for high-risk processing.

🛡 Defenses & Robustness

Adversarial defense is an arms race. Many defenses that appeared robust were later broken by adaptive attacks designed specifically against them. The adversarial ML community now demands evaluation against adaptive white-box attacks — defenses that only obscure gradients (gradient masking) rather than truly improving robustness are routinely circumvented. RobustBench maintains a public leaderboard of verified robust models against AutoAttack.

Adversarial Training (PGD-AT)

The most reliable empirical defense: include adversarially perturbed examples in the training set. Madry et al. formulated this as a minimax problem — find model parameters that minimize loss on the worst-case perturbation of each training example. PGD-AT with a large perturbation budget (ε=8/255 for ImageNet) consistently produces the most robust models on leaderboards. Computational cost is ~10x standard training. TRADES (Zhang et al., 2019) improved the accuracy-robustness tradeoff by explicitly regularizing the KL divergence between clean and perturbed predictions.

Most ReliableHigh Compute Cost

Certified Defenses & Randomized Smoothing

Certified defenses provide mathematical guarantees — provably no L2 perturbation below radius r can change the prediction. Cohen et al. (2019) showed that adding Gaussian noise to inputs before classification and using majority voting (randomized smoothing) provides certified L2 robustness. The certified radius scales with the noise level — higher noise gives larger certificates but hurts clean accuracy. IBP (Interval Bound Propagation) and CROWN-IBP give certificates for small L∞ perturbations on small networks. Certified robustness on ImageNet at ε=0.5 L2 still lags clean accuracy by ~20%.

Provable GuaranteeAccuracy Tradeoff

Input Preprocessing Defenses

Feature squeezing (Xu et al., 2018): reduce color depth or apply spatial smoothing to remove adversarial perturbations, then compare the model's outputs on the original and squeezed inputs — large divergence signals an adversarial example. JPEG compression, denoising autoencoders, and diffusion-model purification (DiffPure, 2022) are other preprocessing approaches. DiffPure achieved strong results by denoising inputs using a pre-trained diffusion model before classification. However, adaptive attacks that account for the purification step can often still succeed, making evaluation tricky.

Detection-BasedAdaptive Attack Vulnerable

Defense Methods Summary

Defense	Mechanism	Robustness Type	Accuracy Cost	Adaptive Attack Resistant?
PGD-AT	Adversarial training	Empirical (L∞)	~10–15% clean acc loss	Partially — strongest empirical defense
TRADES	Regularized adv. training	Empirical (L∞)	~8–12% clean acc loss	Partially — improved tradeoff vs PGD-AT
Randomized Smoothing	Gaussian noise + majority vote	Certified (L2)	~15–25% clean acc loss	Yes — provable guarantee within radius
IBP / CROWN-IBP	Bound propagation	Certified (L∞ small ε)	High — practical only for small nets	Yes — provable but limited scale
Feature Squeezing	Input preprocessing + detection	Detection	Minimal on clean inputs	No — broken by adaptive attacks
DiffPure	Diffusion model denoising	Empirical	Inference slowdown	Partially — strong but adaptive attacks exist
Ensemble Diversity	Multiple diverse models	Empirical — reduces transfer	Marginal	Partially — reduces transferability

NIST AI RMF on Adversarial Robustness

The NIST AI Risk Management Framework (AI RMF 1.0, 2023) addresses adversarial robustness under the MANAGE function. Key guidance includes: conducting red-team evaluations using diverse attack methods, documenting robustness results in model cards, establishing robustness benchmarks for high-stakes deployments, and tracking emerging adversarial techniques through threat intelligence. NIST SP 800-231 (draft, 2024) provides specific guidance on adversarial ML threat models and mitigations aligned with the AI RMF.

The Accuracy-Robustness Tradeoff

Adversarial robustness and standard accuracy are fundamentally in tension. A model adversarially trained to be robust to ε=8/255 L∞ perturbations on CIFAR-10 typically achieves ~84% clean accuracy vs ~95% for a standard model. This tradeoff has theoretical foundations: Tsipras et al. (2019) showed that for some distributions, robust and accurate classifiers must use different features. In practice, this means deploying adversarially robust models involves a conscious business decision: how much clean-accuracy degradation is acceptable for the robustness gained? Never deploy adversarial defenses without characterizing both their robustness guarantees AND their impact on normal-case performance.