⏱ 7 min read πŸ“Š Beginner πŸ—“ Updated Jan 2025

⚠ The Imbalance Problem

Class imbalance occurs when the distribution of target labels is skewed β€” one class vastly outnumbers another. This is not an edge case: it is the default condition in almost every real-world classification problem of practical interest. Fraud, disease, intrusion, defect β€” the things we most want to detect are always rare.

The Accuracy Trap

In a fraud detection dataset where 99.9% of transactions are legitimate, a model that always predicts "not fraud" achieves 99.9% accuracy. It has a perfect score. It also catches exactly zero fraudsters. Accuracy is a useless metric for imbalanced problems. This is not a subtle issue β€” it destroys businesses that don't notice it until production.

Real-World Imbalance Ratios

  • Credit card fraud β€” typically 0.1–0.5% fraud; ratio ~1:200 to 1:1000
  • Medical diagnosis (rare disease) β€” prevalence 1:10,000 or worse in population screening
  • Network intrusion detection β€” attack traffic often <1% of total packets
  • Manufacturing defect detection β€” defect rate often 0.1–5% of production runs
  • Churn prediction β€” typically 5–15% churn; moderate imbalance
  • Spam detection β€” 20–40% spam in most systems; mild imbalance, well-studied

The Confusion Matrix Tells the Truth

For imbalanced classification, always look at the full confusion matrix, not just accuracy. The tell-tale sign of a model that learned nothing:

  • True Positive (TP) β€” minority class correctly identified
  • True Negative (TN) β€” majority class correctly identified (very high for naΓ―ve classifier)
  • False Positive (FP) β€” majority predicted as minority (Type I error)
  • False Negative (FN) β€” minority predicted as majority (Type II error β€” usually catastrophic)
  • A model that always predicts majority: TP=0, FN=all minority examples β€” 100% miss rate on the thing you care about

🔄 Resampling Techniques

Resampling modifies the training data distribution to balance the classes. The goal is to give the model equal opportunity to learn patterns from both minority and majority classes during training β€” while evaluating on the true (imbalanced) distribution.

Method Type Effect When to Use
Random Oversampling Oversample Duplicates minority examples at random Quick baseline; mild imbalance; small datasets
SMOTE Oversample Creates synthetic minority examples by interpolating between nearest neighbours Tabular data; moderate imbalance (1:10 to 1:100); default choice
ADASYN Oversample Adaptive SMOTE β€” generates more samples near the decision boundary (harder examples) When SMOTE under-performs; complex decision boundaries
Random Undersampling Undersample Randomly removes majority examples Very large datasets where training cost is the bottleneck; information loss acceptable
Tomek Links Undersample Removes majority class examples that are nearest neighbours of minority examples (border cleaning) Cleaning the boundary rather than heavy undersampling; combine with SMOTE
NearMiss Undersample Keeps only majority examples closest to minority examples (versions 1/2/3) When large dataset prevents oversampling; tends to be aggressive β€” test carefully
SMOTETomek / SMOTEENN Combined SMOTE oversample + Tomek/ENN cleaning of noisy samples Best general-purpose combination for tabular imbalanced data

How SMOTE Works

SMOTE (Synthetic Minority Over-sampling TEchnique) generates new minority-class samples by:

  • For each minority sample, find its k nearest neighbours (default k=5) within the minority class
  • Randomly select one neighbour
  • Create a new sample on the line segment between the original and the selected neighbour: new = original + Ξ» Γ— (neighbour βˆ’ original) where Ξ» ∈ [0,1]
  • Repeat until the desired ratio is reached
  • Result: more diverse minority examples that fill in the feature space, rather than just duplicates
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Create severely imbalanced dataset (1:50 ratio)
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=10,
    weights=[0.98, 0.02], random_state=42
)
print(f"Class distribution: {np.bincount(y)}")  # ~4900 vs ~100

# CRITICAL: Split BEFORE resampling (never resample before the split!)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify preserves ratio
)

# ---- SMOTE ----
smote = SMOTE(k_neighbors=5, random_state=42, sampling_strategy="minority")
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {np.bincount(y_train_sm)}")  # balanced

# ---- ADASYN (adaptive) ----
adasyn = ADASYN(n_neighbors=5, random_state=42)
X_train_ad, y_train_ad = adasyn.fit_resample(X_train, y_train)

# ---- SMOTETomek (oversample + clean boundary) ----
smote_tomek = SMOTETomek(random_state=42)
X_train_st, y_train_st = smote_tomek.fit_resample(X_train, y_train)

# ---- Best practice: use imbalanced-learn Pipeline ----
# Resampling ONLY happens during fit, NOT during transform/predict
pipeline = ImbPipeline([
    ("oversample", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validation with stratified folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["majority", "minority"]))

🧠 Algorithm-Level Approaches

Rather than modifying the data, algorithm-level approaches modify how the model trains or predicts to handle imbalance. These are often simpler to implement and less risky than resampling.

Class Weights

Tell the model to penalise misclassifying the minority class more heavily. class_weight='balanced' automatically sets weights inversely proportional to class frequencies.

  • Works for: LogisticRegression, SVM, RandomForest, GradientBoosting, DecisionTree in sklearn
  • Equivalent to oversampling in many linear models
  • No synthetic data created β€” no distributional shift
  • Usually the first thing to try before resampling

Threshold Tuning

By default, classifiers predict the positive class when P(class=1) > 0.5. For imbalanced data, lowering this threshold (e.g., to 0.2) increases recall at the cost of precision.

  • Use ROC curve to find the optimal threshold for your precision/recall trade-off
  • Use PR curve for severe imbalance β€” it's more informative than ROC in this regime
  • Always tune threshold on a validation set, not the test set
  • Business context drives the threshold: in fraud detection, recall (catching fraud) matters more than precision (false alarms)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, roc_curve,
                              precision_recall_curve, average_precision_score)
import matplotlib
matplotlib.use("Agg")  # non-interactive backend

X, y = make_classification(weights=[0.95, 0.05], n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

# ---- class_weight='balanced' ----
lr_balanced = LogisticRegression(class_weight="balanced", max_iter=1000)
lr_balanced.fit(X_train, y_train)
print("Balanced LR:\n", classification_report(y_test, lr_balanced.predict(X_test)))

# ---- Custom class weights (explicit ratio) ----
lr_custom = LogisticRegression(class_weight={0: 1, 1: 20}, max_iter=1000)
lr_custom.fit(X_train, y_train)

# ---- Threshold tuning via ROC curve ----
rf = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
rf.fit(X_train, y_train)
y_proba = rf.predict_proba(X_test)[:, 1]

# Find threshold that maximises F1 for minority class
from sklearn.metrics import f1_score
thresholds = np.arange(0.1, 0.9, 0.05)
best_thresh, best_f1 = 0.5, 0
for thresh in thresholds:
    y_pred_t = (y_proba >= thresh).astype(int)
    f1 = f1_score(y_test, y_pred_t, pos_label=1)
    if f1 > best_f1:
        best_f1, best_thresh = f1, thresh

print(f"Best threshold: {best_thresh:.2f}, F1 minority: {best_f1:.3f}")
y_pred_tuned = (y_proba >= best_thresh).astype(int)
print(classification_report(y_test, y_pred_tuned))

# ---- BalancedBaggingClassifier (ensemble for imbalance) ----
from imblearn.ensemble import BalancedBaggingClassifier, EasyEnsembleClassifier
bbc = BalancedBaggingClassifier(
    estimator=RandomForestClassifier(n_estimators=10),
    n_estimators=10, random_state=42
)
bbc.fit(X_train, y_train)
print("BalancedBagging:\n", classification_report(y_test, bbc.predict(X_test)))

📏 Evaluation Metrics for Imbalanced Data

Choosing the right metric is as important as choosing the right model. Using accuracy on imbalanced data doesn't just give a misleading number β€” it actively hides the problem from you.

Metric Formula Interpretation When to Use
Accuracy (TP+TN) / N % of all predictions correct Only for balanced datasets β€” misleading otherwise
Precision TP / (TP+FP) Of all predicted positives, how many are real? (cost of false alarm matters) When false positives are costly β€” spam filters, medical alerts
Recall (Sensitivity) TP / (TP+FN) Of all actual positives, how many did we catch? (cost of missing matters) When false negatives are costly β€” fraud, disease detection, safety systems
F1 Score 2 Γ— (PΓ—R) / (P+R) Harmonic mean of precision and recall; 0 if either is 0 Balanced trade-off between precision and recall; class imbalance scenarios
Macro F1 Mean of per-class F1 scores Treats all classes equally regardless of frequency When minority class performance matters as much as majority
ROC-AUC Area under TPR vs FPR curve Probability that model ranks a random positive above a random negative; 0.5 = random General ranking quality; but optimistic for severe imbalance
PR-AUC (Average Precision) Area under Precision vs Recall curve Summarises precision-recall trade-off; 1.0 = perfect; baseline = prevalence Severe imbalance β€” more informative than ROC-AUC when positive class is rare
Matthews Correlation Coefficient (MCC) (TPΓ—TN βˆ’ FPΓ—FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)) Ranges [-1, 1]; 0 = random; 1 = perfect; uses all four confusion matrix cells The most reliable single metric for imbalanced binary classification

ROC-AUC vs PR-AUC: Which to Use?

ROC-AUC looks at TPR vs FPR. In severely imbalanced settings, FPR = FP/(FP+TN) remains low even with many false positives because TN dominates. This makes ROC curves overly optimistic. PR curves look at precision and recall β€” both focus on the minority class β€” making PR-AUC the preferred metric when you have severe imbalance (<5% positive rate). A random classifier on PR-AUC scores only at the prevalence rate (e.g., 0.01 for 1% positive rate), making differences meaningful.

✅ Practical Checklist

Imbalance Ratio Matters: 1:10 vs 1:1000

Mild imbalance (1:10) β€” class_weight='balanced' alone often suffices. Moderate imbalance (1:100) β€” SMOTE or threshold tuning recommended. Severe imbalance (1:1000+) β€” consider anomaly detection approaches (Isolation Forest, One-Class SVM, autoencoders) rather than standard classification; collect more minority data if at all possible.

Before Training

  • Measure the exact imbalance ratio: y.value_counts(normalize=True)
  • Consider collecting more minority class data first β€” no technique beats real data
  • Use stratified train/test/validation splits β€” sklearn's stratify=y parameter
  • Never resample before the train/test split β€” this causes data leakage
  • Establish a strong baseline: majority class classifier and its metrics

During Training

  • Use StratifiedKFold for cross-validation β€” preserves class ratios in every fold
  • Apply resampling inside each cross-validation fold (use imbalanced-learn Pipeline)
  • Start with class_weight='balanced' β€” simplest fix with no data modification
  • If resampling, prefer SMOTETomek or SMOTEENN over plain SMOTE
  • Report PR-AUC, F1 (macro), and MCC β€” not just accuracy

After Training

  • Tune classification threshold on validation set using PR curve
  • Consider the business cost of FP vs FN β€” set threshold accordingly
  • Monitor distribution shift in production β€” real-world imbalance ratio may differ from training
  • Alert when minority class prediction rate drops unexpectedly
  • Re-train periodically with fresh data as minority examples accumulate