Handling Imbalanced Datasets | CyberSecurityHub

⚠ The Imbalance Problem

Class imbalance occurs when the distribution of target labels is skewed — one class vastly outnumbers another. This is not an edge case: it is the default condition in almost every real-world classification problem of practical interest. Fraud, disease, intrusion, defect — the things we most want to detect are always rare.

The Accuracy Trap

In a fraud detection dataset where 99.9% of transactions are legitimate, a model that always predicts "not fraud" achieves 99.9% accuracy. It has a perfect score. It also catches exactly zero fraudsters. Accuracy is a useless metric for imbalanced problems. This is not a subtle issue — it destroys businesses that don't notice it until production.

Real-World Imbalance Ratios

Credit card fraud — typically 0.1–0.5% fraud; ratio ~1:200 to 1:1000
Medical diagnosis (rare disease) — prevalence 1:10,000 or worse in population screening
Network intrusion detection — attack traffic often <1% of total packets
Manufacturing defect detection — defect rate often 0.1–5% of production runs
Churn prediction — typically 5–15% churn; moderate imbalance
Spam detection — 20–40% spam in most systems; mild imbalance, well-studied

The Confusion Matrix Tells the Truth

For imbalanced classification, always look at the full confusion matrix, not just accuracy. The tell-tale sign of a model that learned nothing:

True Positive (TP) — minority class correctly identified
True Negative (TN) — majority class correctly identified (very high for naïve classifier)
False Positive (FP) — majority predicted as minority (Type I error)
False Negative (FN) — minority predicted as majority (Type II error — usually catastrophic)
A model that always predicts majority: TP=0, FN=all minority examples — 100% miss rate on the thing you care about

🔄 Resampling Techniques

Resampling modifies the training data distribution to balance the classes. The goal is to give the model equal opportunity to learn patterns from both minority and majority classes during training — while evaluating on the true (imbalanced) distribution.

Method	Type	Effect	When to Use
Random Oversampling	Oversample	Duplicates minority examples at random	Quick baseline; mild imbalance; small datasets
SMOTE	Oversample	Creates synthetic minority examples by interpolating between nearest neighbours	Tabular data; moderate imbalance (1:10 to 1:100); default choice
ADASYN	Oversample	Adaptive SMOTE — generates more samples near the decision boundary (harder examples)	When SMOTE under-performs; complex decision boundaries
Random Undersampling	Undersample	Randomly removes majority examples	Very large datasets where training cost is the bottleneck; information loss acceptable
Tomek Links	Undersample	Removes majority class examples that are nearest neighbours of minority examples (border cleaning)	Cleaning the boundary rather than heavy undersampling; combine with SMOTE
NearMiss	Undersample	Keeps only majority examples closest to minority examples (versions 1/2/3)	When large dataset prevents oversampling; tends to be aggressive — test carefully
SMOTETomek / SMOTEENN	Combined	SMOTE oversample + Tomek/ENN cleaning of noisy samples	Best general-purpose combination for tabular imbalanced data

How SMOTE Works

SMOTE (Synthetic Minority Over-sampling TEchnique) generates new minority-class samples by:

For each minority sample, find its k nearest neighbours (default k=5) within the minority class
Randomly select one neighbour
Create a new sample on the line segment between the original and the selected neighbour: new = original + λ × (neighbour − original) where λ ∈ [0,1]
Repeat until the desired ratio is reached
Result: more diverse minority examples that fill in the feature space, rather than just duplicates

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Create severely imbalanced dataset (1:50 ratio)
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=10,
    weights=[0.98, 0.02], random_state=42
)
print(f"Class distribution: {np.bincount(y)}")  # ~4900 vs ~100

# CRITICAL: Split BEFORE resampling (never resample before the split!)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify preserves ratio
)

# ---- SMOTE ----
smote = SMOTE(k_neighbors=5, random_state=42, sampling_strategy="minority")
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {np.bincount(y_train_sm)}")  # balanced

# ---- ADASYN (adaptive) ----
adasyn = ADASYN(n_neighbors=5, random_state=42)
X_train_ad, y_train_ad = adasyn.fit_resample(X_train, y_train)

# ---- SMOTETomek (oversample + clean boundary) ----
smote_tomek = SMOTETomek(random_state=42)
X_train_st, y_train_st = smote_tomek.fit_resample(X_train, y_train)

# ---- Best practice: use imbalanced-learn Pipeline ----
# Resampling ONLY happens during fit, NOT during transform/predict
pipeline = ImbPipeline([
    ("oversample", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validation with stratified folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["majority", "minority"]))

🧠 Algorithm-Level Approaches

Rather than modifying the data, algorithm-level approaches modify how the model trains or predicts to handle imbalance. These are often simpler to implement and less risky than resampling.

Class Weights

Tell the model to penalise misclassifying the minority class more heavily. class_weight='balanced' automatically sets weights inversely proportional to class frequencies.

Works for: LogisticRegression, SVM, RandomForest, GradientBoosting, DecisionTree in sklearn
Equivalent to oversampling in many linear models
No synthetic data created — no distributional shift
Usually the first thing to try before resampling

Threshold Tuning

By default, classifiers predict the positive class when P(class=1) > 0.5. For imbalanced data, lowering this threshold (e.g., to 0.2) increases recall at the cost of precision.

Use ROC curve to find the optimal threshold for your precision/recall trade-off
Use PR curve for severe imbalance — it's more informative than ROC in this regime
Always tune threshold on a validation set, not the test set
Business context drives the threshold: in fraud detection, recall (catching fraud) matters more than precision (false alarms)

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, roc_curve,
                              precision_recall_curve, average_precision_score)
import matplotlib
matplotlib.use("Agg")  # non-interactive backend

X, y = make_classification(weights=[0.95, 0.05], n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

# ---- class_weight='balanced' ----
lr_balanced = LogisticRegression(class_weight="balanced", max_iter=1000)
lr_balanced.fit(X_train, y_train)
print("Balanced LR:\n", classification_report(y_test, lr_balanced.predict(X_test)))

# ---- Custom class weights (explicit ratio) ----
lr_custom = LogisticRegression(class_weight={0: 1, 1: 20}, max_iter=1000)
lr_custom.fit(X_train, y_train)

# ---- Threshold tuning via ROC curve ----
rf = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
rf.fit(X_train, y_train)
y_proba = rf.predict_proba(X_test)[:, 1]

# Find threshold that maximises F1 for minority class
from sklearn.metrics import f1_score
thresholds = np.arange(0.1, 0.9, 0.05)
best_thresh, best_f1 = 0.5, 0
for thresh in thresholds:
    y_pred_t = (y_proba >= thresh).astype(int)
    f1 = f1_score(y_test, y_pred_t, pos_label=1)
    if f1 > best_f1:
        best_f1, best_thresh = f1, thresh

print(f"Best threshold: {best_thresh:.2f}, F1 minority: {best_f1:.3f}")
y_pred_tuned = (y_proba >= best_thresh).astype(int)
print(classification_report(y_test, y_pred_tuned))

# ---- BalancedBaggingClassifier (ensemble for imbalance) ----
from imblearn.ensemble import BalancedBaggingClassifier, EasyEnsembleClassifier
bbc = BalancedBaggingClassifier(
    estimator=RandomForestClassifier(n_estimators=10),
    n_estimators=10, random_state=42
)
bbc.fit(X_train, y_train)
print("BalancedBagging:\n", classification_report(y_test, bbc.predict(X_test)))

📏 Evaluation Metrics for Imbalanced Data

Choosing the right metric is as important as choosing the right model. Using accuracy on imbalanced data doesn't just give a misleading number — it actively hides the problem from you.

Metric	Formula	Interpretation	When to Use
Accuracy	(TP+TN) / N	% of all predictions correct	Only for balanced datasets — misleading otherwise
Precision	TP / (TP+FP)	Of all predicted positives, how many are real? (cost of false alarm matters)	When false positives are costly — spam filters, medical alerts
Recall (Sensitivity)	TP / (TP+FN)	Of all actual positives, how many did we catch? (cost of missing matters)	When false negatives are costly — fraud, disease detection, safety systems
F1 Score	2 × (P×R) / (P+R)	Harmonic mean of precision and recall; 0 if either is 0	Balanced trade-off between precision and recall; class imbalance scenarios
Macro F1	Mean of per-class F1 scores	Treats all classes equally regardless of frequency	When minority class performance matters as much as majority
ROC-AUC	Area under TPR vs FPR curve	Probability that model ranks a random positive above a random negative; 0.5 = random	General ranking quality; but optimistic for severe imbalance
PR-AUC (Average Precision)	Area under Precision vs Recall curve	Summarises precision-recall trade-off; 1.0 = perfect; baseline = prevalence	Severe imbalance — more informative than ROC-AUC when positive class is rare
Matthews Correlation Coefficient (MCC)	(TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))	Ranges [-1, 1]; 0 = random; 1 = perfect; uses all four confusion matrix cells	The most reliable single metric for imbalanced binary classification

ROC-AUC vs PR-AUC: Which to Use?

ROC-AUC looks at TPR vs FPR. In severely imbalanced settings, FPR = FP/(FP+TN) remains low even with many false positives because TN dominates. This makes ROC curves overly optimistic. PR curves look at precision and recall — both focus on the minority class — making PR-AUC the preferred metric when you have severe imbalance (<5% positive rate). A random classifier on PR-AUC scores only at the prevalence rate (e.g., 0.01 for 1% positive rate), making differences meaningful.

✅ Practical Checklist

Imbalance Ratio Matters: 1:10 vs 1:1000

Mild imbalance (1:10) — class_weight='balanced' alone often suffices. Moderate imbalance (1:100) — SMOTE or threshold tuning recommended. Severe imbalance (1:1000+) — consider anomaly detection approaches (Isolation Forest, One-Class SVM, autoencoders) rather than standard classification; collect more minority data if at all possible.

Before Training

Measure the exact imbalance ratio: y.value_counts(normalize=True)
Consider collecting more minority class data first — no technique beats real data
Use stratified train/test/validation splits — sklearn's stratify=y parameter
Never resample before the train/test split — this causes data leakage
Establish a strong baseline: majority class classifier and its metrics

During Training

Use StratifiedKFold for cross-validation — preserves class ratios in every fold
Apply resampling inside each cross-validation fold (use imbalanced-learn Pipeline)
Start with class_weight='balanced' — simplest fix with no data modification
If resampling, prefer SMOTETomek or SMOTEENN over plain SMOTE
Report PR-AUC, F1 (macro), and MCC — not just accuracy

After Training

Tune classification threshold on validation set using PR curve
Consider the business cost of FP vs FN — set threshold accordingly
Monitor distribution shift in production — real-world imbalance ratio may differ from training
Alert when minority class prediction rate drops unexpectedly
Re-train periodically with fresh data as minority examples accumulate