⚙ The Estimator API
The Unified Interface
Every scikit-learn object — classifiers, regressors, transformers, clusterers — follows the same interface. This consistency is the library's greatest strength: swap algorithms without changing surrounding code.
fit(X, y)— learn from training datapredict(X)— generate predictionspredict_proba(X)— class probabilities (classifiers)transform(X)— apply learned transformationfit_transform(X)— fit then transform (shortcut)score(X, y)— default metric (accuracy or R²)get_params()/set_params()— hyperparameter access
Pipeline: Chain Everything
A Pipeline chains transformers and a final estimator into a single object. It guarantees that preprocessing steps are fitted only on training data and applied consistently to test data — the single most important practice for avoiding data leakage.
- All steps except last must implement
transform pipeline.fit(X_train, y_train)fits all stepspipeline.predict(X_test)applies all transforms first- GridSearchCV sees the pipeline as a single estimator
- Hyperparams addressed as
stepname__param make_pipeline()— auto-names steps from class names
Why Consistency Matters
The Estimator API design means your evaluation code, cross-validation, and hyperparameter search all work identically regardless of algorithm. Replace LogisticRegression with RandomForestClassifier in one line and the entire pipeline still works.
- Algorithms are swappable — great for experimentation
- No framework-specific training loops needed
- Serialise any fitted pipeline with joblib
- Compatible with all sklearn evaluation utilities
- Custom transformers extend
BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Generate a synthetic binary classification dataset
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# ── Pipeline: StandardScaler + LogisticRegression (10 lines) ──────────────────
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000, C=1.0))
])
pipe.fit(X_train, y_train) # fits scaler AND classifier
y_pred = pipe.predict(X_test) # scales X_test, then predicts
y_prob = pipe.predict_proba(X_test)[:, 1] # probability of class 1
print(classification_report(y_test, y_pred))
# ── Swap algorithm in ONE line ─────────────────────────────────────────────────
pipe_rf = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe_rf.fit(X_train, y_train)
print(f"RF accuracy: {pipe_rf.score(X_test, y_test):.3f}")
# ── make_pipeline shorthand (auto-names steps) ─────────────────────────────────
pipe2 = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe2.fit(X_train, y_train)
# ── Access individual steps ────────────────────────────────────────────────────
scaler = pipe.named_steps['scaler']
print(f"Feature means (first 5): {scaler.mean_[:5].round(3)}")
# ── Save and load a fitted pipeline ───────────────────────────────────────────
import joblib
joblib.dump(pipe, 'pipeline.joblib')
loaded_pipe = joblib.load('pipeline.joblib')
print(f"Loaded accuracy: {loaded_pipe.score(X_test, y_test):.3f}")
🤖 Supervised Learning Algorithms
Start with RandomForest as Your Baseline
RandomForestClassifier/Regressor requires minimal preprocessing (no scaling needed), handles mixed feature types gracefully, is robust to outliers, provides feature importances, and achieves strong out-of-the-box performance. Establish your random forest baseline before investing in more complex models.
| Algorithm | Class | Key Hyperparameters | Notes |
|---|---|---|---|
| Logistic Regression | LogisticRegression |
C, penalty, max_iter, solver | Strong baseline for text/linear problems; fast; interpretable coefficients |
| Random Forest | RandomForestClassifier / Regressor |
n_estimators, max_depth, min_samples_leaf, max_features | Robust baseline; no scaling needed; built-in feature importance |
| Gradient Boosting | GradientBoostingClassifier / Regressor |
n_estimators, learning_rate, max_depth, subsample | Often best on tabular data; slower to train; use HistGradientBoosting for speed |
| Support Vector Machine | SVC / SVR |
C, kernel, gamma, degree | Excellent for high-dim data; scales poorly past ~50k samples; needs scaling |
| K-Nearest Neighbours | KNeighborsClassifier / Regressor |
n_neighbors, metric, weights | No training phase; slow at prediction on large datasets; needs scaling |
| Linear Regression | LinearRegression |
(none key) | OLS; fast; interpretable; assumes linear relationship and normal residuals |
| Ridge Regression | Ridge |
alpha | L2 regularisation; prevents multicollinearity; always prefer over plain OLS |
| Lasso Regression | Lasso |
alpha | L1 regularisation; induces sparsity; automatic feature selection |
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np
# ── Classification comparison ─────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=30, n_informative=15,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
classifiers = {
'LogisticReg': make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)),
'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42),
'GradBoost': HistGradientBoostingClassifier(random_state=42),
'SVM': make_pipeline(StandardScaler(), SVC(probability=True)),
'KNN': make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=7)),
}
print("=== Classifier Comparison (5-fold CV accuracy) ===")
for name, clf in classifiers.items():
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)
print(f" {name:15s}: {scores.mean():.4f} ± {scores.std():.4f}")
# ── Feature importances from RandomForest ────────────────────────────────────
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_k = np.argsort(importances)[::-1][:5]
print("\nTop 5 features by importance:")
for rank, idx in enumerate(top_k, 1):
print(f" {rank}. Feature {idx}: {importances[idx]:.4f}")
# ── Regression: Ridge vs Lasso ────────────────────────────────────────────────
Xr, yr = make_regression(n_samples=500, n_features=50, n_informative=20,
noise=10, random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.2)
for alpha in [0.1, 1.0, 10.0, 100.0]:
ridge = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
lasso = make_pipeline(StandardScaler(), Lasso(alpha=alpha, max_iter=2000))
ridge.fit(Xr_train, yr_train)
lasso.fit(Xr_train, yr_train)
r_coef = ridge.named_steps['ridge'].coef_
l_coef = lasso.named_steps['lasso'].coef_
print(f"alpha={alpha:6.1f} Ridge R²={ridge.score(Xr_test,yr_test):.3f} "
f"Lasso R²={lasso.score(Xr_test,yr_test):.3f} "
f"Lasso zeros={np.sum(l_coef==0)}")
🛠 Preprocessing & Feature Engineering
Scalers
Many algorithms (SVM, KNN, neural networks, logistic regression with regularisation) require features on similar scales. Tree-based methods (RandomForest, GBM) do not require scaling.
StandardScaler— zero mean, unit variance; sensitive to outliersMinMaxScaler— scales to [0,1]; sensitive to outliersRobustScaler— uses median/IQR; robust to outliersMaxAbsScaler— divides by max absolute value; sparse-safePowerTransformer— makes data more Gaussian-like
Encoders
ML algorithms require numeric inputs. Categorical features must be encoded. The right encoder depends on the cardinality and the downstream algorithm.
OneHotEncoder— binary columns per category; high cardinality = slowOrdinalEncoder— integer codes; use for tree models or ordered categoriesLabelEncoder— encodes target variable y onlyTargetEncoder— mean-encode with cross-fitting; great for GBM- High-cardinality: consider hashing trick or embedding layers
Other Transformers
Feature engineering transformers can be composed into pipelines and cross-validated exactly like models.
SimpleImputer— fill NaN with mean/median/mode/constantPolynomialFeatures— interaction terms and powersFunctionTransformer— wrap any numpy functionSelectKBest— univariate feature selectionPCA— dimensionality reductionColumnTransformer— apply different steps per column subset
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# ── Heterogeneous dataset with numeric and categorical features ───────────────
rng = np.random.default_rng(42)
n = 800
df = pd.DataFrame({
'age': rng.integers(18, 70, n).astype(float),
'salary': rng.normal(60000, 20000, n),
'tenure': rng.integers(0, 30, n).astype(float),
'score': rng.normal(75, 15, n),
'department': rng.choice(['Eng', 'Sales', 'HR', 'Finance'], n),
'education': rng.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
'target': rng.integers(0, 2, n)
})
# Inject some missing values
df.loc[rng.choice(n, 60, replace=False), 'age'] = np.nan
df.loc[rng.choice(n, 40, replace=False), 'department'] = np.nan
X = df.drop(columns=['target'])
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
# ── Define column groups ──────────────────────────────────────────────────────
numeric_features = ['age', 'salary', 'tenure', 'score']
categorical_features = ['department', 'education']
# ── Sub-pipelines per column type ─────────────────────────────────────────────
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', RobustScaler()), # robust to outlier salaries
])
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
# ── ColumnTransformer assembles the two sub-pipelines ─────────────────────────
preprocessor = ColumnTransformer([
('num', numeric_pipeline, numeric_features),
('cat', categorical_pipeline, categorical_features),
], remainder='drop')
# ── Full pipeline with classifier ─────────────────────────────────────────────
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# Inspect preprocessed shape
X_transformed = preprocessor.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"After preprocessing: {X_transformed.shape[1]}")
# numeric: 4, OHE department: 4, OHE education: 4 → 12 total
🎯 Model Selection & Evaluation
Cross-Validation
Never evaluate a model on the same data used to train it. Cross-validation provides a reliable estimate of generalisation performance by averaging results over multiple train/test splits.
cross_val_score— k-fold CV, returns per-fold scorescross_validate— also returns fit/score time and train scoresStratifiedKFold— preserves class balance per foldRepeatedStratifiedKFold— repeat k-fold for variance estimate- Use n_jobs=-1 to parallelise across folds
- Report mean ± std, not just mean accuracy
Hyperparameter Search
Default hyperparameters are rarely optimal. Systematic search over a defined grid or distribution finds better configurations, but must be done carefully to avoid overfitting on the validation set.
GridSearchCV— exhaustive; use for small gridsRandomizedSearchCV— samples n_iter combinations; better for large spacesHalvingGridSearchCV— successive halving; faster- Optuna/Hyperopt: Bayesian optimisation for deep search
- Always use the CV object's
.best_estimator_for final eval
Scoring Metrics
Accuracy is often the wrong metric. Imbalanced classes, different misclassification costs, and probability calibration all demand specialised metrics. Choose your metric before looking at results.
accuracy— misleading on imbalanced dataf1/f1_macro/f1_weighted— precision/recall balanceroc_auc— probability ranking qualityaverage_precision— area under PR curveneg_mean_squared_error— regression MSE (negated)r2— coefficient of determination
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import (
train_test_split, GridSearchCV, RandomizedSearchCV,
cross_val_score, StratifiedKFold
)
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, average_precision_score, f1_score
)
from sklearn.datasets import make_classification
import numpy as np
from scipy.stats import randint, uniform
# Imbalanced dataset: 90% negative, 10% positive (fraud detection-style)
X, y = make_classification(
n_samples=3000, n_features=25, n_informative=12,
weights=[0.9, 0.1], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# ── Step 1: Cross-validate baseline ───────────────────────────────────────────
baseline = Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(random_state=42))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(baseline, X_train, y_train,
cv=cv, scoring='roc_auc', n_jobs=-1)
print(f"Baseline ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")
# ── Step 2: RandomizedSearchCV to find good hyperparameters ──────────────────
param_dist = {
'clf__n_estimators': randint(100, 500),
'clf__max_depth': randint(2, 8),
'clf__learning_rate': uniform(0.01, 0.2),
'clf__subsample': uniform(0.6, 0.4),
'clf__min_samples_leaf': randint(5, 50),
}
search = RandomizedSearchCV(
baseline,
param_distributions=param_dist,
n_iter=40, # sample 40 random combinations
cv=cv,
scoring='roc_auc',
n_jobs=-1,
random_state=42,
verbose=1,
return_train_score=True,
)
search.fit(X_train, y_train)
print(f"\nBest ROC-AUC (CV): {search.best_score_:.4f}")
print("Best params:")
for k, v in search.best_params_.items():
print(f" {k}: {v}")
# ── Step 3: Evaluate best model on held-out test set ─────────────────────────
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]
print(f"\nTest ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Test Avg Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"Test F1 (weighted): {f1_score(y_test, y_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))
📊 Unsupervised Learning
| Algorithm | Class | Use Case | Key Parameter | Output |
|---|---|---|---|---|
| K-Means | KMeans |
Customer segmentation, vector quantisation, image compression | n_clusters, init, n_init | Cluster labels, centroids, inertia |
| DBSCAN | DBSCAN |
Arbitrary shape clusters, noise/outlier detection | eps, min_samples | Cluster labels (-1 = noise) |
| Gaussian Mixture | GaussianMixture |
Soft clustering, density estimation, generative modelling | n_components, covariance_type | Soft labels, log-likelihood |
| PCA | PCA |
Dimensionality reduction, visualisation, noise removal | n_components | Projected data, explained variance ratio |
| Isolation Forest | IsolationForest |
Anomaly detection in high-dimensional tabular data | contamination, n_estimators | Anomaly scores, inlier/outlier labels |
| UMAP / t-SNE | TSNE / umap-learn |
2D/3D visualisation of high-dimensional embeddings | n_components, perplexity (t-SNE) | 2D/3D coordinates for plotting |
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs, make_moons
import numpy as np
rng = np.random.default_rng(42)
# ── K-Means with Elbow Method ─────────────────────────────────────────────────
X_blobs, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_blobs)
inertias = []
silhouettes = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels, sample_size=300))
best_k = k_range.start + np.argmax(silhouettes)
print(f"Best k by silhouette score: {best_k}")
print("Silhouette scores:", [f"{s:.3f}" for s in silhouettes])
km_final = KMeans(n_clusters=best_k, n_init=20, random_state=42)
km_final.fit(X_scaled)
print(f"Cluster sizes: {np.bincount(km_final.labels_)}")
# ── DBSCAN for non-convex clusters ────────────────────────────────────────────
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.15, min_samples=5)
labels_db = db.fit_predict(X_moons)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = (labels_db == -1).sum()
print(f"\nDBSCAN: {n_clusters} clusters, {n_noise} noise points")
# ── PCA for visualisation + variance analysis ─────────────────────────────────
X_high_dim = rng.standard_normal((300, 50))
X_high_dim[:150] += 2.0 # class 0 offset
pca = PCA()
pca.fit(StandardScaler().fit_transform(X_high_dim))
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"\nPCA: {n_95} components explain 95% of variance")
pca_vis = PCA(n_components=2)
X_2d = pca_vis.fit_transform(StandardScaler().fit_transform(X_high_dim))
print(f"2D projection: {X_2d.shape}, variance explained: {pca_vis.explained_variance_ratio_.sum():.2%}")
# ── IsolationForest for anomaly detection ─────────────────────────────────────
X_normal = rng.standard_normal((950, 10))
X_anomaly = rng.standard_normal((50, 10)) * 4 + 6 # anomalies far from origin
X_all = np.vstack([X_normal, X_anomaly])
true_labels = np.hstack([np.ones(950), -np.ones(50)]) # 1=normal, -1=anomaly
iforest = IsolationForest(contamination=0.05, random_state=42, n_estimators=200)
pred_labels = iforest.fit_predict(X_all)
TP = ((pred_labels == -1) & (true_labels == -1)).sum()
FP = ((pred_labels == -1) & (true_labels == 1)).sum()
print(f"\nIsolation Forest: {TP} true anomalies detected, {FP} false positives")