Normalization & Standardization

⚖ Why Scale Matters

Imagine a dataset with two features: age (range 0–100) and annual salary (range $20,000–$500,000). Without scaling, salary dominates any distance calculation by a factor of 5,000. Your model doesn't know one feature is "more important" — it just sees numbers, and big numbers win.

Models Broken by Unscaled Features

K-Nearest Neighbours (KNN) — Euclidean distance is dominated by high-magnitude features; scaling is mandatory
Support Vector Machines (SVM) — kernel trick and margin maximisation both depend on distances; performance collapses without scaling
K-Means Clustering — cluster assignment via centroid distance; unscaled features make some dimensions irrelevant
Principal Component Analysis (PCA) — variance-driven; high-magnitude features dominate first PCs; standardise before PCA
Neural Networks — gradient descent converges much faster when inputs are on similar scales; saturating activation functions (sigmoid, tanh) require inputs near zero
Regularised regression (Lasso, Ridge) — the regularisation penalty treats all coefficients equally; unscaled features receive disproportionate penalties

Models Scale-Invariant (Scaling Optional)

These algorithms are insensitive to feature scale because their splitting or prediction mechanisms don't use distances or gradient updates on raw values:

Decision Trees — splits based on thresholds; scaling doesn't change relative ordering
Random Forest — ensemble of trees; inherits tree scale-invariance
Gradient Boosting (XGBoost, LightGBM, CatBoost) — tree-based; scale invariant
Naive Bayes — feature distributions modelled independently; scale doesn't affect likelihood ratios

Still Scale Anyway?

Even for tree models, scaling doesn't hurt and makes feature importances more interpretable. When unsure, scale — it never breaks things.

📐 Normalisation (Min-Max Scaling)

Min-max normalisation maps each feature to a fixed range, typically [0, 1] or [-1, 1]. It preserves the original distribution's shape exactly — it's a linear transformation that just rescales the axis.

x' = (x − min(x)) / (max(x) − min(x))

x' = a + (x − min(x)) × (b − a) / (max(x) − min(x)) [for range [a, b]]

When to Use Min-Max

When the algorithm requires input in a bounded range (neural networks with sigmoid output, image pixel values [0,1])
When you want to preserve zero values (no shift, only scale — unlike standardisation)
When the distribution is not Gaussian and you need bounded output
When comparing features where the [0,1] range has intuitive meaning (percentages, probabilities)

Weaknesses of Min-Max

Outlier sensitivity — a single extreme value (age = 999) will compress all other values near zero
Test data can fall outside [0,1] — if test data has values outside the training min/max, scaled values exceed [0,1]; don't re-fit on test!
Doesn't centre the data — mean may not be near 0.5 after scaling if distribution is skewed

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data: age (0-100) and income ($20K-$500K)
X_train = np.array([[25, 45000], [35, 120000], [45, 80000], [55, 200000]])
X_test  = np.array([[30, 95000], [60, 150000]])

# ---- Fit ONLY on training data ----
scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)    # NOT fit_transform!

print("Scaled training data:")
print(X_train_scaled)
# age 25 -> 0.0, age 55 -> 1.0; income 45000 -> 0.0, income 200000 -> 1.0

# ---- Custom range [-1, 1] ----
scaler_neg = MinMaxScaler(feature_range=(-1, 1))
X_train_neg = scaler_neg.fit_transform(X_train)

# ---- Inverse transform (recover original values) ----
X_original = scaler.inverse_transform(X_train_scaled)

# ---- Check for values outside [0,1] in test set ----
out_of_range = ((X_test_scaled < 0) | (X_test_scaled > 1)).any()
if out_of_range:
    print("Warning: test data contains values outside training range!")

📈 Standardisation (Z-score)

Standardisation rescales features to have mean 0 and standard deviation 1. Unlike normalisation, it doesn't bound the output — a standardised feature can range from -3 to +3 (or further for outliers). This is the most widely used scaling technique in ML practice.

x' = (x − μ) / σ

where μ = mean of feature, σ = standard deviation of feature

When to Prefer Standardisation

Default choice for most ML algorithms — works well across a wide range of models
When the distribution is approximately Gaussian (or you're not sure)
For regularised linear models (Lasso, Ridge, ElasticNet) where coefficients are penalised equally
For PCA — ensures each feature contributes equally based on information content, not scale
When outliers are present but you don't want to remove them — z-score is less distorted than min-max by outliers
For gradient-based optimisation (neural nets) — centred inputs (mean ≈ 0) help with gradient flow through activation functions

Standardisation vs. Normalisation — Quick Guide

Bounded output needed → Min-Max
Gaussian or unknown distribution → Standardisation
Outliers present → Standardisation (or RobustScaler)
Neural network pixel inputs → Min-Max to [0,1]
PCA preprocessing → Standardisation
Sparse data (many zeros) → MaxAbsScaler (preserves sparsity)
Heavily skewed features → PowerTransformer then StandardScaler

from sklearn.preprocessing import StandardScaler

X_train = [[25, 45000], [35, 120000], [45, 80000], [55, 200000]]
X_test  = [[30, 95000], [60, 150000]]

# ---- Fit on train, transform both ----
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std  = scaler.transform(X_test)      # use TRAIN mean and std

print("Means (should be ~0):", X_train_std.mean(axis=0))
print("Stds  (should be ~1):", X_train_std.std(axis=0))

# ---- Inspect scaler statistics ----
print("Feature means:", scaler.mean_)
print("Feature stds:", scaler.scale_)

# ---- Inverse transform ----
import numpy as np
X_recovered = scaler.inverse_transform(X_train_std)

# ---- Standardise a single new sample at inference time ----
new_sample = np.array([[40, 100000]])
new_sample_std = scaler.transform(new_sample)  # uses fitted scaler

🛡 Robust & Other Scalers

The standard scaler family assumes relatively clean, well-behaved distributions. In practice, features may be heavily skewed, contain persistent outliers, or have special structure (sparsity). These specialised scalers handle those cases.

Scaler	Formula / Mechanism	Outlier Robust	Best For
StandardScaler	(x − mean) / std	No — outliers distort mean and std	General purpose; near-Gaussian distributions
MinMaxScaler	(x − min) / (max − min)	No — one outlier can ruin the range	Bounded outputs needed; image pixels
RobustScaler	(x − median) / IQR	Yes — uses median and IQR, ignoring tails	Data with outliers that can't be removed
MaxAbsScaler	x / max(\|x\|)	Partially — not centred; sensitive to max outlier	Sparse data (preserves zeros); already centred data
PowerTransformer (Box-Cox)	Optimises λ in x^λ transformation to normalise	No — transform first, then scale	Strictly positive, right-skewed features (income, price)
PowerTransformer (Yeo-Johnson)	Extends Box-Cox to handle zero and negative values	No	Skewed features with zeros or negatives
QuantileTransformer	Maps to uniform or Gaussian distribution via quantile mapping	Yes — outliers are collapsed to tails	Arbitrary distributions; making features strictly Gaussian

from sklearn.preprocessing import (
    RobustScaler, MaxAbsScaler, PowerTransformer, QuantileTransformer
)
import numpy as np

# Skewed income data with outliers
X = np.array([[500], [1000], [2000], [3000], [5000], [100000]])  # last = outlier

# ---- RobustScaler — resistant to the $100K outlier ----
rob = RobustScaler()
print("RobustScaler:", rob.fit_transform(X).T)
# Uses median=2500, IQR=2500; outlier becomes large but doesn't ruin others

# ---- PowerTransformer (Yeo-Johnson) — makes skewed dist more Gaussian ----
pt = PowerTransformer(method="yeo-johnson", standardize=True)
print("PowerTransformer:", pt.fit_transform(X).T)

# ---- QuantileTransformer — forces uniform or Gaussian output ----
qt = QuantileTransformer(output_distribution="normal", random_state=42)
print("QuantileTransformer:", qt.fit_transform(X).T)

# ---- MaxAbsScaler — good for sparse matrices ----
from scipy.sparse import csr_matrix
X_sparse = csr_matrix([[0, 1, 0], [3, 0, 2], [0, 0, 5]])
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)
# Zeros remain zero — sparsity preserved

🧭 Practical Guidelines

The Cardinal Rule: Fit on Train Only

Always compute scaling statistics (mean, std, min, max, quantiles) using only the training set, then apply the same transformation to validation and test sets. Fitting on the full dataset before the train/test split leaks information from the future into your model — a subtle form of data leakage that inflates metrics and causes production failures. Call fit_transform(X_train) and transform(X_test) — never fit_transform(X_test).

Scaling Targets in Regression

Scaling the target variable (y) in regression can help gradient-based models; tree models don't need it
Use a separate scaler for y — don't include it in the feature scaler
Remember to inverse-transform predictions before computing final metrics in the original scale
Log transformation of skewed targets (house prices, population) often works better than standard scaling
TransformedTargetRegressor in sklearn wraps this cleanly

Per-Feature Scaling Strategy

Different feature types may need different scalers — use ColumnTransformer
Binary / one-hot encoded features — don't scale; already in [0,1]
Count features — log1p transform then standard scale
Percentage features [0,100] — divide by 100 or min-max to [0,1]
Embedding vectors from neural nets — often L2-normalise per sample rather than scale per feature
Cyclical features (hour of day, day of week) — sin/cos encoding, not scaling

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample dataset
data = pd.DataFrame({
    "age":     [25, 35, 45, 55, 28, 38],
    "income":  [45000, 120000, 80000, 200000, 55000, 95000],
    "score":   [0.2, 0.8, 0.6, 0.9, 0.4, 0.7],   # already [0,1]
    "country": ["US", "UK", "US", "DE", "UK", "US"],
    "target":  [0, 1, 0, 1, 0, 1]
})

X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# ---- ColumnTransformer: different scalers per feature type ----
numeric_normal   = ["age", "score"]              # approx Gaussian -> StandardScaler
numeric_skewed   = ["income"]                    # right-skewed -> PowerTransformer
categorical      = ["country"]                   # categorical -> OneHotEncoder

preprocessor = ColumnTransformer(transformers=[
    ("num_std",   Pipeline([
        ("impute", SimpleImputer(strategy="median")),
        ("scale",  StandardScaler())
    ]), numeric_normal),

    ("num_skew",  Pipeline([
        ("impute", SimpleImputer(strategy="median")),
        ("power",  PowerTransformer(method="yeo-johnson")),
    ]), numeric_skewed),

    ("cat",       Pipeline([
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ]), categorical),
], remainder="drop")

# ---- Full pipeline: preprocessing + model ----
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)
print("Test accuracy:", pipeline.score(X_test, y_test))

# Scaler is fit only on X_train — transform is applied to X_test internally
# Save the entire pipeline to preserve scaling stats for production
import joblib
joblib.dump(pipeline, "trained_pipeline.pkl")