Scikit-learn for ML Algorithms

⚙ The Estimator API

The Unified Interface

Every scikit-learn object — classifiers, regressors, transformers, clusterers — follows the same interface. This consistency is the library's greatest strength: swap algorithms without changing surrounding code.

fit(X, y) — learn from training data
predict(X) — generate predictions
predict_proba(X) — class probabilities (classifiers)
transform(X) — apply learned transformation
fit_transform(X) — fit then transform (shortcut)
score(X, y) — default metric (accuracy or R²)
get_params() / set_params() — hyperparameter access

Pipeline: Chain Everything

A Pipeline chains transformers and a final estimator into a single object. It guarantees that preprocessing steps are fitted only on training data and applied consistently to test data — the single most important practice for avoiding data leakage.

All steps except last must implement transform
pipeline.fit(X_train, y_train) fits all steps
pipeline.predict(X_test) applies all transforms first
GridSearchCV sees the pipeline as a single estimator
Hyperparams addressed as stepname__param
make_pipeline() — auto-names steps from class names

Why Consistency Matters

The Estimator API design means your evaluation code, cross-validation, and hyperparameter search all work identically regardless of algorithm. Replace LogisticRegression with RandomForestClassifier in one line and the entire pipeline still works.

Algorithms are swappable — great for experimentation
No framework-specific training loops needed
Serialise any fitted pipeline with joblib
Compatible with all sklearn evaluation utilities
Custom transformers extend BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate a synthetic binary classification dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── Pipeline: StandardScaler + LogisticRegression (10 lines) ──────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, C=1.0))
])

pipe.fit(X_train, y_train)                          # fits scaler AND classifier
y_pred = pipe.predict(X_test)                       # scales X_test, then predicts
y_prob = pipe.predict_proba(X_test)[:, 1]           # probability of class 1

print(classification_report(y_test, y_pred))

# ── Swap algorithm in ONE line ─────────────────────────────────────────────────
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe_rf.fit(X_train, y_train)
print(f"RF accuracy: {pipe_rf.score(X_test, y_test):.3f}")

# ── make_pipeline shorthand (auto-names steps) ─────────────────────────────────
pipe2 = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe2.fit(X_train, y_train)

# ── Access individual steps ────────────────────────────────────────────────────
scaler = pipe.named_steps['scaler']
print(f"Feature means (first 5): {scaler.mean_[:5].round(3)}")

# ── Save and load a fitted pipeline ───────────────────────────────────────────
import joblib
joblib.dump(pipe, 'pipeline.joblib')
loaded_pipe = joblib.load('pipeline.joblib')
print(f"Loaded accuracy: {loaded_pipe.score(X_test, y_test):.3f}")

🤖 Supervised Learning Algorithms

Start with RandomForest as Your Baseline

RandomForestClassifier/Regressor requires minimal preprocessing (no scaling needed), handles mixed feature types gracefully, is robust to outliers, provides feature importances, and achieves strong out-of-the-box performance. Establish your random forest baseline before investing in more complex models.

Algorithm	Class	Key Hyperparameters	Notes
Logistic Regression	`LogisticRegression`	C, penalty, max_iter, solver	Strong baseline for text/linear problems; fast; interpretable coefficients
Random Forest	`RandomForestClassifier / Regressor`	n_estimators, max_depth, min_samples_leaf, max_features	Robust baseline; no scaling needed; built-in feature importance
Gradient Boosting	`GradientBoostingClassifier / Regressor`	n_estimators, learning_rate, max_depth, subsample	Often best on tabular data; slower to train; use HistGradientBoosting for speed
Support Vector Machine	`SVC / SVR`	C, kernel, gamma, degree	Excellent for high-dim data; scales poorly past ~50k samples; needs scaling
K-Nearest Neighbours	`KNeighborsClassifier / Regressor`	n_neighbors, metric, weights	No training phase; slow at prediction on large datasets; needs scaling
Linear Regression	`LinearRegression`	(none key)	OLS; fast; interpretable; assumes linear relationship and normal residuals
Ridge Regression	`Ridge`	alpha	L2 regularisation; prevents multicollinearity; always prefer over plain OLS
Lasso Regression	`Lasso`	alpha	L1 regularisation; induces sparsity; automatic feature selection

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

# ── Classification comparison ─────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=30, n_informative=15,
                            random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42, stratify=y)
classifiers = {
    'LogisticReg':  make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)),
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42),
    'GradBoost':    HistGradientBoostingClassifier(random_state=42),
    'SVM':          make_pipeline(StandardScaler(), SVC(probability=True)),
    'KNN':          make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=7)),
}
print("=== Classifier Comparison (5-fold CV accuracy) ===")
for name, clf in classifiers.items():
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)
    print(f"  {name:15s}: {scores.mean():.4f} ± {scores.std():.4f}")

# ── Feature importances from RandomForest ────────────────────────────────────
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_k = np.argsort(importances)[::-1][:5]
print("\nTop 5 features by importance:")
for rank, idx in enumerate(top_k, 1):
    print(f"  {rank}. Feature {idx}: {importances[idx]:.4f}")

# ── Regression: Ridge vs Lasso ────────────────────────────────────────────────
Xr, yr = make_regression(n_samples=500, n_features=50, n_informative=20,
                          noise=10, random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.2)
for alpha in [0.1, 1.0, 10.0, 100.0]:
    ridge = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
    lasso = make_pipeline(StandardScaler(), Lasso(alpha=alpha, max_iter=2000))
    ridge.fit(Xr_train, yr_train)
    lasso.fit(Xr_train, yr_train)
    r_coef = ridge.named_steps['ridge'].coef_
    l_coef = lasso.named_steps['lasso'].coef_
    print(f"alpha={alpha:6.1f}  Ridge R²={ridge.score(Xr_test,yr_test):.3f}  "
          f"Lasso R²={lasso.score(Xr_test,yr_test):.3f}  "
          f"Lasso zeros={np.sum(l_coef==0)}")

🛠 Preprocessing & Feature Engineering

Scalers

Many algorithms (SVM, KNN, neural networks, logistic regression with regularisation) require features on similar scales. Tree-based methods (RandomForest, GBM) do not require scaling.

StandardScaler — zero mean, unit variance; sensitive to outliers
MinMaxScaler — scales to [0,1]; sensitive to outliers
RobustScaler — uses median/IQR; robust to outliers
MaxAbsScaler — divides by max absolute value; sparse-safe
PowerTransformer — makes data more Gaussian-like

Encoders

ML algorithms require numeric inputs. Categorical features must be encoded. The right encoder depends on the cardinality and the downstream algorithm.

OneHotEncoder — binary columns per category; high cardinality = slow
OrdinalEncoder — integer codes; use for tree models or ordered categories
LabelEncoder — encodes target variable y only
TargetEncoder — mean-encode with cross-fitting; great for GBM
High-cardinality: consider hashing trick or embedding layers

Other Transformers

Feature engineering transformers can be composed into pipelines and cross-validated exactly like models.

SimpleImputer — fill NaN with mean/median/mode/constant
PolynomialFeatures — interaction terms and powers
FunctionTransformer — wrap any numpy function
SelectKBest — univariate feature selection
PCA — dimensionality reduction
ColumnTransformer — apply different steps per column subset

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# ── Heterogeneous dataset with numeric and categorical features ───────────────
rng = np.random.default_rng(42)
n = 800
df = pd.DataFrame({
    'age':        rng.integers(18, 70, n).astype(float),
    'salary':     rng.normal(60000, 20000, n),
    'tenure':     rng.integers(0, 30, n).astype(float),
    'score':      rng.normal(75, 15, n),
    'department': rng.choice(['Eng', 'Sales', 'HR', 'Finance'], n),
    'education':  rng.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'target':     rng.integers(0, 2, n)
})

# Inject some missing values
df.loc[rng.choice(n, 60, replace=False), 'age'] = np.nan
df.loc[rng.choice(n, 40, replace=False), 'department'] = np.nan

X = df.drop(columns=['target'])
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      stratify=y, random_state=42)

# ── Define column groups ──────────────────────────────────────────────────────
numeric_features     = ['age', 'salary', 'tenure', 'score']
categorical_features = ['department', 'education']

# ── Sub-pipelines per column type ─────────────────────────────────────────────
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  RobustScaler()),          # robust to outlier salaries
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# ── ColumnTransformer assembles the two sub-pipelines ─────────────────────────
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline,     numeric_features),
    ('cat', categorical_pipeline, categorical_features),
], remainder='drop')

# ── Full pipeline with classifier ─────────────────────────────────────────────
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier',   RandomForestClassifier(n_estimators=200, random_state=42)),
])

full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Inspect preprocessed shape
X_transformed = preprocessor.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"After preprocessing: {X_transformed.shape[1]}")
# numeric: 4, OHE department: 4, OHE education: 4 → 12 total

🎯 Model Selection & Evaluation

Cross-Validation

Never evaluate a model on the same data used to train it. Cross-validation provides a reliable estimate of generalisation performance by averaging results over multiple train/test splits.

cross_val_score — k-fold CV, returns per-fold scores
cross_validate — also returns fit/score time and train scores
StratifiedKFold — preserves class balance per fold
RepeatedStratifiedKFold — repeat k-fold for variance estimate
Use n_jobs=-1 to parallelise across folds
Report mean ± std, not just mean accuracy

Hyperparameter Search

Default hyperparameters are rarely optimal. Systematic search over a defined grid or distribution finds better configurations, but must be done carefully to avoid overfitting on the validation set.

GridSearchCV — exhaustive; use for small grids
RandomizedSearchCV — samples n_iter combinations; better for large spaces
HalvingGridSearchCV — successive halving; faster
Optuna/Hyperopt: Bayesian optimisation for deep search
Always use the CV object's .best_estimator_ for final eval

Scoring Metrics

Accuracy is often the wrong metric. Imbalanced classes, different misclassification costs, and probability calibration all demand specialised metrics. Choose your metric before looking at results.

accuracy — misleading on imbalanced data
f1 / f1_macro / f1_weighted — precision/recall balance
roc_auc — probability ranking quality
average_precision — area under PR curve
neg_mean_squared_error — regression MSE (negated)
r2 — coefficient of determination

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import (
    train_test_split, GridSearchCV, RandomizedSearchCV,
    cross_val_score, StratifiedKFold
)
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score, f1_score
)
from sklearn.datasets import make_classification
import numpy as np
from scipy.stats import randint, uniform

# Imbalanced dataset: 90% negative, 10% positive (fraud detection-style)
X, y = make_classification(
    n_samples=3000, n_features=25, n_informative=12,
    weights=[0.9, 0.1], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── Step 1: Cross-validate baseline ───────────────────────────────────────────
baseline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', GradientBoostingClassifier(random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(baseline, X_train, y_train,
                         cv=cv, scoring='roc_auc', n_jobs=-1)
print(f"Baseline ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")

# ── Step 2: RandomizedSearchCV to find good hyperparameters ──────────────────
param_dist = {
    'clf__n_estimators':   randint(100, 500),
    'clf__max_depth':      randint(2, 8),
    'clf__learning_rate':  uniform(0.01, 0.2),
    'clf__subsample':      uniform(0.6, 0.4),
    'clf__min_samples_leaf': randint(5, 50),
}
search = RandomizedSearchCV(
    baseline,
    param_distributions=param_dist,
    n_iter=40,                    # sample 40 random combinations
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1,
    return_train_score=True,
)
search.fit(X_train, y_train)

print(f"\nBest ROC-AUC (CV): {search.best_score_:.4f}")
print("Best params:")
for k, v in search.best_params_.items():
    print(f"  {k}: {v}")

# ── Step 3: Evaluate best model on held-out test set ─────────────────────────
best_model = search.best_estimator_
y_pred      = best_model.predict(X_test)
y_prob      = best_model.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC:        {roc_auc_score(y_test, y_prob):.4f}")
print(f"Test Avg Precision:  {average_precision_score(y_test, y_prob):.4f}")
print(f"Test F1 (weighted):  {f1_score(y_test, y_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative','Positive']))

📊 Unsupervised Learning

Algorithm	Class	Use Case	Key Parameter	Output
K-Means	`KMeans`	Customer segmentation, vector quantisation, image compression	n_clusters, init, n_init	Cluster labels, centroids, inertia
DBSCAN	`DBSCAN`	Arbitrary shape clusters, noise/outlier detection	eps, min_samples	Cluster labels (-1 = noise)
Gaussian Mixture	`GaussianMixture`	Soft clustering, density estimation, generative modelling	n_components, covariance_type	Soft labels, log-likelihood
PCA	`PCA`	Dimensionality reduction, visualisation, noise removal	n_components	Projected data, explained variance ratio
Isolation Forest	`IsolationForest`	Anomaly detection in high-dimensional tabular data	contamination, n_estimators	Anomaly scores, inlier/outlier labels
UMAP / t-SNE	`TSNE` / umap-learn	2D/3D visualisation of high-dimensional embeddings	n_components, perplexity (t-SNE)	2D/3D coordinates for plotting

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs, make_moons
import numpy as np

rng = np.random.default_rng(42)

# ── K-Means with Elbow Method ─────────────────────────────────────────────────
X_blobs, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_blobs)

inertias    = []
silhouettes = []
k_range = range(2, 11)
for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels, sample_size=300))

best_k = k_range.start + np.argmax(silhouettes)
print(f"Best k by silhouette score: {best_k}")
print("Silhouette scores:", [f"{s:.3f}" for s in silhouettes])

km_final = KMeans(n_clusters=best_k, n_init=20, random_state=42)
km_final.fit(X_scaled)
print(f"Cluster sizes: {np.bincount(km_final.labels_)}")

# ── DBSCAN for non-convex clusters ────────────────────────────────────────────
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.15, min_samples=5)
labels_db = db.fit_predict(X_moons)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise    = (labels_db == -1).sum()
print(f"\nDBSCAN: {n_clusters} clusters, {n_noise} noise points")

# ── PCA for visualisation + variance analysis ─────────────────────────────────
X_high_dim = rng.standard_normal((300, 50))
X_high_dim[:150] += 2.0    # class 0 offset

pca = PCA()
pca.fit(StandardScaler().fit_transform(X_high_dim))
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"\nPCA: {n_95} components explain 95% of variance")

pca_vis = PCA(n_components=2)
X_2d = pca_vis.fit_transform(StandardScaler().fit_transform(X_high_dim))
print(f"2D projection: {X_2d.shape}, variance explained: {pca_vis.explained_variance_ratio_.sum():.2%}")

# ── IsolationForest for anomaly detection ─────────────────────────────────────
X_normal = rng.standard_normal((950, 10))
X_anomaly = rng.standard_normal((50, 10)) * 4 + 6   # anomalies far from origin
X_all = np.vstack([X_normal, X_anomaly])
true_labels = np.hstack([np.ones(950), -np.ones(50)])  # 1=normal, -1=anomaly

iforest = IsolationForest(contamination=0.05, random_state=42, n_estimators=200)
pred_labels = iforest.fit_predict(X_all)

TP = ((pred_labels == -1) & (true_labels == -1)).sum()
FP = ((pred_labels == -1) & (true_labels ==  1)).sum()
print(f"\nIsolation Forest: {TP} true anomalies detected, {FP} false positives")