⏱ 6 min read 📊 Beginner 🗓 Updated Jan 2025

⚙ The Estimator API

The Unified Interface

Every scikit-learn object — classifiers, regressors, transformers, clusterers — follows the same interface. This consistency is the library's greatest strength: swap algorithms without changing surrounding code.

  • fit(X, y) — learn from training data
  • predict(X) — generate predictions
  • predict_proba(X) — class probabilities (classifiers)
  • transform(X) — apply learned transformation
  • fit_transform(X) — fit then transform (shortcut)
  • score(X, y) — default metric (accuracy or R²)
  • get_params() / set_params() — hyperparameter access

Pipeline: Chain Everything

A Pipeline chains transformers and a final estimator into a single object. It guarantees that preprocessing steps are fitted only on training data and applied consistently to test data — the single most important practice for avoiding data leakage.

  • All steps except last must implement transform
  • pipeline.fit(X_train, y_train) fits all steps
  • pipeline.predict(X_test) applies all transforms first
  • GridSearchCV sees the pipeline as a single estimator
  • Hyperparams addressed as stepname__param
  • make_pipeline() — auto-names steps from class names

Why Consistency Matters

The Estimator API design means your evaluation code, cross-validation, and hyperparameter search all work identically regardless of algorithm. Replace LogisticRegression with RandomForestClassifier in one line and the entire pipeline still works.

  • Algorithms are swappable — great for experimentation
  • No framework-specific training loops needed
  • Serialise any fitted pipeline with joblib
  • Compatible with all sklearn evaluation utilities
  • Custom transformers extend BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate a synthetic binary classification dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── Pipeline: StandardScaler + LogisticRegression (10 lines) ──────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, C=1.0))
])

pipe.fit(X_train, y_train)                          # fits scaler AND classifier
y_pred = pipe.predict(X_test)                       # scales X_test, then predicts
y_prob = pipe.predict_proba(X_test)[:, 1]           # probability of class 1

print(classification_report(y_test, y_pred))

# ── Swap algorithm in ONE line ─────────────────────────────────────────────────
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe_rf.fit(X_train, y_train)
print(f"RF accuracy: {pipe_rf.score(X_test, y_test):.3f}")

# ── make_pipeline shorthand (auto-names steps) ─────────────────────────────────
pipe2 = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe2.fit(X_train, y_train)

# ── Access individual steps ────────────────────────────────────────────────────
scaler = pipe.named_steps['scaler']
print(f"Feature means (first 5): {scaler.mean_[:5].round(3)}")

# ── Save and load a fitted pipeline ───────────────────────────────────────────
import joblib
joblib.dump(pipe, 'pipeline.joblib')
loaded_pipe = joblib.load('pipeline.joblib')
print(f"Loaded accuracy: {loaded_pipe.score(X_test, y_test):.3f}")

🤖 Supervised Learning Algorithms

Start with RandomForest as Your Baseline

RandomForestClassifier/Regressor requires minimal preprocessing (no scaling needed), handles mixed feature types gracefully, is robust to outliers, provides feature importances, and achieves strong out-of-the-box performance. Establish your random forest baseline before investing in more complex models.

Algorithm Class Key Hyperparameters Notes
Logistic Regression LogisticRegression C, penalty, max_iter, solver Strong baseline for text/linear problems; fast; interpretable coefficients
Random Forest RandomForestClassifier / Regressor n_estimators, max_depth, min_samples_leaf, max_features Robust baseline; no scaling needed; built-in feature importance
Gradient Boosting GradientBoostingClassifier / Regressor n_estimators, learning_rate, max_depth, subsample Often best on tabular data; slower to train; use HistGradientBoosting for speed
Support Vector Machine SVC / SVR C, kernel, gamma, degree Excellent for high-dim data; scales poorly past ~50k samples; needs scaling
K-Nearest Neighbours KNeighborsClassifier / Regressor n_neighbors, metric, weights No training phase; slow at prediction on large datasets; needs scaling
Linear Regression LinearRegression (none key) OLS; fast; interpretable; assumes linear relationship and normal residuals
Ridge Regression Ridge alpha L2 regularisation; prevents multicollinearity; always prefer over plain OLS
Lasso Regression Lasso alpha L1 regularisation; induces sparsity; automatic feature selection
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

# ── Classification comparison ─────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=30, n_informative=15,
                            random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42, stratify=y)
classifiers = {
    'LogisticReg':  make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)),
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42),
    'GradBoost':    HistGradientBoostingClassifier(random_state=42),
    'SVM':          make_pipeline(StandardScaler(), SVC(probability=True)),
    'KNN':          make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=7)),
}
print("=== Classifier Comparison (5-fold CV accuracy) ===")
for name, clf in classifiers.items():
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)
    print(f"  {name:15s}: {scores.mean():.4f} ± {scores.std():.4f}")

# ── Feature importances from RandomForest ────────────────────────────────────
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_k = np.argsort(importances)[::-1][:5]
print("\nTop 5 features by importance:")
for rank, idx in enumerate(top_k, 1):
    print(f"  {rank}. Feature {idx}: {importances[idx]:.4f}")

# ── Regression: Ridge vs Lasso ────────────────────────────────────────────────
Xr, yr = make_regression(n_samples=500, n_features=50, n_informative=20,
                          noise=10, random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.2)
for alpha in [0.1, 1.0, 10.0, 100.0]:
    ridge = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
    lasso = make_pipeline(StandardScaler(), Lasso(alpha=alpha, max_iter=2000))
    ridge.fit(Xr_train, yr_train)
    lasso.fit(Xr_train, yr_train)
    r_coef = ridge.named_steps['ridge'].coef_
    l_coef = lasso.named_steps['lasso'].coef_
    print(f"alpha={alpha:6.1f}  Ridge R²={ridge.score(Xr_test,yr_test):.3f}  "
          f"Lasso R²={lasso.score(Xr_test,yr_test):.3f}  "
          f"Lasso zeros={np.sum(l_coef==0)}")

🛠 Preprocessing & Feature Engineering

Scalers

Many algorithms (SVM, KNN, neural networks, logistic regression with regularisation) require features on similar scales. Tree-based methods (RandomForest, GBM) do not require scaling.

  • StandardScaler — zero mean, unit variance; sensitive to outliers
  • MinMaxScaler — scales to [0,1]; sensitive to outliers
  • RobustScaler — uses median/IQR; robust to outliers
  • MaxAbsScaler — divides by max absolute value; sparse-safe
  • PowerTransformer — makes data more Gaussian-like

Encoders

ML algorithms require numeric inputs. Categorical features must be encoded. The right encoder depends on the cardinality and the downstream algorithm.

  • OneHotEncoder — binary columns per category; high cardinality = slow
  • OrdinalEncoder — integer codes; use for tree models or ordered categories
  • LabelEncoder — encodes target variable y only
  • TargetEncoder — mean-encode with cross-fitting; great for GBM
  • High-cardinality: consider hashing trick or embedding layers

Other Transformers

Feature engineering transformers can be composed into pipelines and cross-validated exactly like models.

  • SimpleImputer — fill NaN with mean/median/mode/constant
  • PolynomialFeatures — interaction terms and powers
  • FunctionTransformer — wrap any numpy function
  • SelectKBest — univariate feature selection
  • PCA — dimensionality reduction
  • ColumnTransformer — apply different steps per column subset
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# ── Heterogeneous dataset with numeric and categorical features ───────────────
rng = np.random.default_rng(42)
n = 800
df = pd.DataFrame({
    'age':        rng.integers(18, 70, n).astype(float),
    'salary':     rng.normal(60000, 20000, n),
    'tenure':     rng.integers(0, 30, n).astype(float),
    'score':      rng.normal(75, 15, n),
    'department': rng.choice(['Eng', 'Sales', 'HR', 'Finance'], n),
    'education':  rng.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'target':     rng.integers(0, 2, n)
})

# Inject some missing values
df.loc[rng.choice(n, 60, replace=False), 'age'] = np.nan
df.loc[rng.choice(n, 40, replace=False), 'department'] = np.nan

X = df.drop(columns=['target'])
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      stratify=y, random_state=42)

# ── Define column groups ──────────────────────────────────────────────────────
numeric_features     = ['age', 'salary', 'tenure', 'score']
categorical_features = ['department', 'education']

# ── Sub-pipelines per column type ─────────────────────────────────────────────
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  RobustScaler()),          # robust to outlier salaries
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# ── ColumnTransformer assembles the two sub-pipelines ─────────────────────────
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline,     numeric_features),
    ('cat', categorical_pipeline, categorical_features),
], remainder='drop')

# ── Full pipeline with classifier ─────────────────────────────────────────────
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier',   RandomForestClassifier(n_estimators=200, random_state=42)),
])

full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Inspect preprocessed shape
X_transformed = preprocessor.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"After preprocessing: {X_transformed.shape[1]}")
# numeric: 4, OHE department: 4, OHE education: 4 → 12 total

🎯 Model Selection & Evaluation

Cross-Validation

Never evaluate a model on the same data used to train it. Cross-validation provides a reliable estimate of generalisation performance by averaging results over multiple train/test splits.

  • cross_val_score — k-fold CV, returns per-fold scores
  • cross_validate — also returns fit/score time and train scores
  • StratifiedKFold — preserves class balance per fold
  • RepeatedStratifiedKFold — repeat k-fold for variance estimate
  • Use n_jobs=-1 to parallelise across folds
  • Report mean ± std, not just mean accuracy

Hyperparameter Search

Default hyperparameters are rarely optimal. Systematic search over a defined grid or distribution finds better configurations, but must be done carefully to avoid overfitting on the validation set.

  • GridSearchCV — exhaustive; use for small grids
  • RandomizedSearchCV — samples n_iter combinations; better for large spaces
  • HalvingGridSearchCV — successive halving; faster
  • Optuna/Hyperopt: Bayesian optimisation for deep search
  • Always use the CV object's .best_estimator_ for final eval

Scoring Metrics

Accuracy is often the wrong metric. Imbalanced classes, different misclassification costs, and probability calibration all demand specialised metrics. Choose your metric before looking at results.

  • accuracy — misleading on imbalanced data
  • f1 / f1_macro / f1_weighted — precision/recall balance
  • roc_auc — probability ranking quality
  • average_precision — area under PR curve
  • neg_mean_squared_error — regression MSE (negated)
  • r2 — coefficient of determination
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import (
    train_test_split, GridSearchCV, RandomizedSearchCV,
    cross_val_score, StratifiedKFold
)
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score, f1_score
)
from sklearn.datasets import make_classification
import numpy as np
from scipy.stats import randint, uniform

# Imbalanced dataset: 90% negative, 10% positive (fraud detection-style)
X, y = make_classification(
    n_samples=3000, n_features=25, n_informative=12,
    weights=[0.9, 0.1], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── Step 1: Cross-validate baseline ───────────────────────────────────────────
baseline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', GradientBoostingClassifier(random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(baseline, X_train, y_train,
                         cv=cv, scoring='roc_auc', n_jobs=-1)
print(f"Baseline ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")

# ── Step 2: RandomizedSearchCV to find good hyperparameters ──────────────────
param_dist = {
    'clf__n_estimators':   randint(100, 500),
    'clf__max_depth':      randint(2, 8),
    'clf__learning_rate':  uniform(0.01, 0.2),
    'clf__subsample':      uniform(0.6, 0.4),
    'clf__min_samples_leaf': randint(5, 50),
}
search = RandomizedSearchCV(
    baseline,
    param_distributions=param_dist,
    n_iter=40,                    # sample 40 random combinations
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1,
    return_train_score=True,
)
search.fit(X_train, y_train)

print(f"\nBest ROC-AUC (CV): {search.best_score_:.4f}")
print("Best params:")
for k, v in search.best_params_.items():
    print(f"  {k}: {v}")

# ── Step 3: Evaluate best model on held-out test set ─────────────────────────
best_model = search.best_estimator_
y_pred      = best_model.predict(X_test)
y_prob      = best_model.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC:        {roc_auc_score(y_test, y_prob):.4f}")
print(f"Test Avg Precision:  {average_precision_score(y_test, y_prob):.4f}")
print(f"Test F1 (weighted):  {f1_score(y_test, y_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))

📊 Unsupervised Learning

Algorithm Class Use Case Key Parameter Output
K-Means KMeans Customer segmentation, vector quantisation, image compression n_clusters, init, n_init Cluster labels, centroids, inertia
DBSCAN DBSCAN Arbitrary shape clusters, noise/outlier detection eps, min_samples Cluster labels (-1 = noise)
Gaussian Mixture GaussianMixture Soft clustering, density estimation, generative modelling n_components, covariance_type Soft labels, log-likelihood
PCA PCA Dimensionality reduction, visualisation, noise removal n_components Projected data, explained variance ratio
Isolation Forest IsolationForest Anomaly detection in high-dimensional tabular data contamination, n_estimators Anomaly scores, inlier/outlier labels
UMAP / t-SNE TSNE / umap-learn 2D/3D visualisation of high-dimensional embeddings n_components, perplexity (t-SNE) 2D/3D coordinates for plotting
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs, make_moons
import numpy as np

rng = np.random.default_rng(42)

# ── K-Means with Elbow Method ─────────────────────────────────────────────────
X_blobs, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_blobs)

inertias    = []
silhouettes = []
k_range = range(2, 11)
for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels, sample_size=300))

best_k = k_range.start + np.argmax(silhouettes)
print(f"Best k by silhouette score: {best_k}")
print("Silhouette scores:", [f"{s:.3f}" for s in silhouettes])

km_final = KMeans(n_clusters=best_k, n_init=20, random_state=42)
km_final.fit(X_scaled)
print(f"Cluster sizes: {np.bincount(km_final.labels_)}")

# ── DBSCAN for non-convex clusters ────────────────────────────────────────────
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.15, min_samples=5)
labels_db = db.fit_predict(X_moons)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise    = (labels_db == -1).sum()
print(f"\nDBSCAN: {n_clusters} clusters, {n_noise} noise points")

# ── PCA for visualisation + variance analysis ─────────────────────────────────
X_high_dim = rng.standard_normal((300, 50))
X_high_dim[:150] += 2.0    # class 0 offset

pca = PCA()
pca.fit(StandardScaler().fit_transform(X_high_dim))
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"\nPCA: {n_95} components explain 95% of variance")

pca_vis = PCA(n_components=2)
X_2d = pca_vis.fit_transform(StandardScaler().fit_transform(X_high_dim))
print(f"2D projection: {X_2d.shape}, variance explained: {pca_vis.explained_variance_ratio_.sum():.2%}")

# ── IsolationForest for anomaly detection ─────────────────────────────────────
X_normal = rng.standard_normal((950, 10))
X_anomaly = rng.standard_normal((50, 10)) * 4 + 6   # anomalies far from origin
X_all = np.vstack([X_normal, X_anomaly])
true_labels = np.hstack([np.ones(950), -np.ones(50)])  # 1=normal, -1=anomaly

iforest = IsolationForest(contamination=0.05, random_state=42, n_estimators=200)
pred_labels = iforest.fit_predict(X_all)

TP = ((pred_labels == -1) & (true_labels == -1)).sum()
FP = ((pred_labels == -1) & (true_labels ==  1)).sum()
print(f"\nIsolation Forest: {TP} true anomalies detected, {FP} false positives")