⚖ Why Scale Matters
Imagine a dataset with two features: age (range 0–100) and annual salary (range $20,000–$500,000). Without scaling, salary dominates any distance calculation by a factor of 5,000. Your model doesn't know one feature is "more important" — it just sees numbers, and big numbers win.
Models Broken by Unscaled Features
- K-Nearest Neighbours (KNN) — Euclidean distance is dominated by high-magnitude features; scaling is mandatory
- Support Vector Machines (SVM) — kernel trick and margin maximisation both depend on distances; performance collapses without scaling
- K-Means Clustering — cluster assignment via centroid distance; unscaled features make some dimensions irrelevant
- Principal Component Analysis (PCA) — variance-driven; high-magnitude features dominate first PCs; standardise before PCA
- Neural Networks — gradient descent converges much faster when inputs are on similar scales; saturating activation functions (sigmoid, tanh) require inputs near zero
- Regularised regression (Lasso, Ridge) — the regularisation penalty treats all coefficients equally; unscaled features receive disproportionate penalties
Models Scale-Invariant (Scaling Optional)
These algorithms are insensitive to feature scale because their splitting or prediction mechanisms don't use distances or gradient updates on raw values:
- Decision Trees — splits based on thresholds; scaling doesn't change relative ordering
- Random Forest — ensemble of trees; inherits tree scale-invariance
- Gradient Boosting (XGBoost, LightGBM, CatBoost) — tree-based; scale invariant
- Naive Bayes — feature distributions modelled independently; scale doesn't affect likelihood ratios
Still Scale Anyway?
Even for tree models, scaling doesn't hurt and makes feature importances more interpretable. When unsure, scale — it never breaks things.
📐 Normalisation (Min-Max Scaling)
Min-max normalisation maps each feature to a fixed range, typically [0, 1] or [-1, 1]. It preserves the original distribution's shape exactly — it's a linear transformation that just rescales the axis.
When to Use Min-Max
- When the algorithm requires input in a bounded range (neural networks with sigmoid output, image pixel values [0,1])
- When you want to preserve zero values (no shift, only scale — unlike standardisation)
- When the distribution is not Gaussian and you need bounded output
- When comparing features where the [0,1] range has intuitive meaning (percentages, probabilities)
Weaknesses of Min-Max
- Outlier sensitivity — a single extreme value (age = 999) will compress all other values near zero
- Test data can fall outside [0,1] — if test data has values outside the training min/max, scaled values exceed [0,1]; don't re-fit on test!
- Doesn't centre the data — mean may not be near 0.5 after scaling if distribution is skewed
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Sample data: age (0-100) and income ($20K-$500K)
X_train = np.array([[25, 45000], [35, 120000], [45, 80000], [55, 200000]])
X_test = np.array([[30, 95000], [60, 150000]])
# ---- Fit ONLY on training data ----
scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # NOT fit_transform!
print("Scaled training data:")
print(X_train_scaled)
# age 25 -> 0.0, age 55 -> 1.0; income 45000 -> 0.0, income 200000 -> 1.0
# ---- Custom range [-1, 1] ----
scaler_neg = MinMaxScaler(feature_range=(-1, 1))
X_train_neg = scaler_neg.fit_transform(X_train)
# ---- Inverse transform (recover original values) ----
X_original = scaler.inverse_transform(X_train_scaled)
# ---- Check for values outside [0,1] in test set ----
out_of_range = ((X_test_scaled < 0) | (X_test_scaled > 1)).any()
if out_of_range:
print("Warning: test data contains values outside training range!")
📈 Standardisation (Z-score)
Standardisation rescales features to have mean 0 and standard deviation 1. Unlike normalisation, it doesn't bound the output — a standardised feature can range from -3 to +3 (or further for outliers). This is the most widely used scaling technique in ML practice.
where μ = mean of feature, σ = standard deviation of feature
When to Prefer Standardisation
- Default choice for most ML algorithms — works well across a wide range of models
- When the distribution is approximately Gaussian (or you're not sure)
- For regularised linear models (Lasso, Ridge, ElasticNet) where coefficients are penalised equally
- For PCA — ensures each feature contributes equally based on information content, not scale
- When outliers are present but you don't want to remove them — z-score is less distorted than min-max by outliers
- For gradient-based optimisation (neural nets) — centred inputs (mean ≈ 0) help with gradient flow through activation functions
Standardisation vs. Normalisation — Quick Guide
- Bounded output needed → Min-Max
- Gaussian or unknown distribution → Standardisation
- Outliers present → Standardisation (or RobustScaler)
- Neural network pixel inputs → Min-Max to [0,1]
- PCA preprocessing → Standardisation
- Sparse data (many zeros) → MaxAbsScaler (preserves sparsity)
- Heavily skewed features → PowerTransformer then StandardScaler
from sklearn.preprocessing import StandardScaler
X_train = [[25, 45000], [35, 120000], [45, 80000], [55, 200000]]
X_test = [[30, 95000], [60, 150000]]
# ---- Fit on train, transform both ----
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test) # use TRAIN mean and std
print("Means (should be ~0):", X_train_std.mean(axis=0))
print("Stds (should be ~1):", X_train_std.std(axis=0))
# ---- Inspect scaler statistics ----
print("Feature means:", scaler.mean_)
print("Feature stds:", scaler.scale_)
# ---- Inverse transform ----
import numpy as np
X_recovered = scaler.inverse_transform(X_train_std)
# ---- Standardise a single new sample at inference time ----
new_sample = np.array([[40, 100000]])
new_sample_std = scaler.transform(new_sample) # uses fitted scaler
🛡 Robust & Other Scalers
The standard scaler family assumes relatively clean, well-behaved distributions. In practice, features may be heavily skewed, contain persistent outliers, or have special structure (sparsity). These specialised scalers handle those cases.
| Scaler | Formula / Mechanism | Outlier Robust | Best For |
|---|---|---|---|
| StandardScaler | (x − mean) / std | No — outliers distort mean and std | General purpose; near-Gaussian distributions |
| MinMaxScaler | (x − min) / (max − min) | No — one outlier can ruin the range | Bounded outputs needed; image pixels |
| RobustScaler | (x − median) / IQR | Yes — uses median and IQR, ignoring tails | Data with outliers that can't be removed |
| MaxAbsScaler | x / max(|x|) | Partially — not centred; sensitive to max outlier | Sparse data (preserves zeros); already centred data |
| PowerTransformer (Box-Cox) | Optimises λ in x^λ transformation to normalise | No — transform first, then scale | Strictly positive, right-skewed features (income, price) |
| PowerTransformer (Yeo-Johnson) | Extends Box-Cox to handle zero and negative values | No | Skewed features with zeros or negatives |
| QuantileTransformer | Maps to uniform or Gaussian distribution via quantile mapping | Yes — outliers are collapsed to tails | Arbitrary distributions; making features strictly Gaussian |
from sklearn.preprocessing import (
RobustScaler, MaxAbsScaler, PowerTransformer, QuantileTransformer
)
import numpy as np
# Skewed income data with outliers
X = np.array([[500], [1000], [2000], [3000], [5000], [100000]]) # last = outlier
# ---- RobustScaler — resistant to the $100K outlier ----
rob = RobustScaler()
print("RobustScaler:", rob.fit_transform(X).T)
# Uses median=2500, IQR=2500; outlier becomes large but doesn't ruin others
# ---- PowerTransformer (Yeo-Johnson) — makes skewed dist more Gaussian ----
pt = PowerTransformer(method="yeo-johnson", standardize=True)
print("PowerTransformer:", pt.fit_transform(X).T)
# ---- QuantileTransformer — forces uniform or Gaussian output ----
qt = QuantileTransformer(output_distribution="normal", random_state=42)
print("QuantileTransformer:", qt.fit_transform(X).T)
# ---- MaxAbsScaler — good for sparse matrices ----
from scipy.sparse import csr_matrix
X_sparse = csr_matrix([[0, 1, 0], [3, 0, 2], [0, 0, 5]])
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)
# Zeros remain zero — sparsity preserved
🧭 Practical Guidelines
The Cardinal Rule: Fit on Train Only
Always compute scaling statistics (mean, std, min, max, quantiles) using only the training set, then apply the same transformation to validation and test sets. Fitting on the full dataset before the train/test split leaks information from the future into your model — a subtle form of data leakage that inflates metrics and causes production failures. Call fit_transform(X_train) and transform(X_test) — never fit_transform(X_test).
Scaling Targets in Regression
- Scaling the target variable (y) in regression can help gradient-based models; tree models don't need it
- Use a separate scaler for y — don't include it in the feature scaler
- Remember to inverse-transform predictions before computing final metrics in the original scale
- Log transformation of skewed targets (house prices, population) often works better than standard scaling
TransformedTargetRegressorin sklearn wraps this cleanly
Per-Feature Scaling Strategy
- Different feature types may need different scalers — use
ColumnTransformer - Binary / one-hot encoded features — don't scale; already in [0,1]
- Count features — log1p transform then standard scale
- Percentage features [0,100] — divide by 100 or min-max to [0,1]
- Embedding vectors from neural nets — often L2-normalise per sample rather than scale per feature
- Cyclical features (hour of day, day of week) — sin/cos encoding, not scaling
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Sample dataset
data = pd.DataFrame({
"age": [25, 35, 45, 55, 28, 38],
"income": [45000, 120000, 80000, 200000, 55000, 95000],
"score": [0.2, 0.8, 0.6, 0.9, 0.4, 0.7], # already [0,1]
"country": ["US", "UK", "US", "DE", "UK", "US"],
"target": [0, 1, 0, 1, 0, 1]
})
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# ---- ColumnTransformer: different scalers per feature type ----
numeric_normal = ["age", "score"] # approx Gaussian -> StandardScaler
numeric_skewed = ["income"] # right-skewed -> PowerTransformer
categorical = ["country"] # categorical -> OneHotEncoder
preprocessor = ColumnTransformer(transformers=[
("num_std", Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler())
]), numeric_normal),
("num_skew", Pipeline([
("impute", SimpleImputer(strategy="median")),
("power", PowerTransformer(method="yeo-johnson")),
]), numeric_skewed),
("cat", Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
]), categorical),
], remainder="drop")
# ---- Full pipeline: preprocessing + model ----
pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)
print("Test accuracy:", pipeline.score(X_test, y_test))
# Scaler is fit only on X_train — transform is applied to X_test internally
# Save the entire pipeline to preserve scaling stats for production
import joblib
joblib.dump(pipeline, "trained_pipeline.pkl")