Model Evaluation & Metrics

Measuring how well your model actually works — and choosing the right yardstick

Classification Regression Confusion Matrix ROC / AUC

ML Fundamentals Series

Supervised vs. Unsupervised Training, Validation & Test Sets Model Evaluation & Metrics Overfitting & Underfitting Feature Engineering Model Deployment & Inference

← Training, Validation & Test Sets Overfitting & Underfitting →

⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025

Classification Metrics

For classification tasks, choosing the right metric is critical. The wrong metric can lead you to deploy a model that is dangerously wrong in ways the metric hides. Accuracy alone is almost never sufficient.

Accuracy

(TP + TN) / Total

Fraction of all predictions that were correct. Misleading on imbalanced datasets.

Precision

TP / (TP + FP)

Of all positive predictions, how many were actually positive? Penalizes false alarms.

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we catch? Penalizes missed detections.

F1-Score

2 × (P × R) / (P + R)

Harmonic mean of precision and recall. Balanced metric for imbalanced classes.

The Precision-Recall Tradeoff

Precision and recall are inversely related for most classifiers. Adjusting the classification threshold changes where on this tradeoff curve your model sits. There is no universally right answer — the optimal balance depends entirely on the cost of each type of error.

Scenario	Optimize For	Why
Cancer screening (diagnostic test)	High Recall	Missing a cancer case is far worse than a false positive that triggers follow-up tests
Spam email filter	High Precision	Marking a legitimate email as spam destroys trust; missing some spam is acceptable
Fraud detection (auto-block)	High Precision	Blocking real transactions frustrates customers; high recall version should alert for review
Search result ranking	Precision@K	Top-k results should be relevant; exhaustive retrieval matters less than top-result quality
General imbalanced classification	F1-Score	Balanced consideration of both precision and recall on skewed class distributions

Why Accuracy Fails on Imbalanced Data

Imagine 99% of transactions are legitimate, 1% are fraudulent. A model that simply labels everything as "legitimate" achieves 99% accuracy — but has 0% recall for fraud. This makes accuracy useless as a metric here. Always check class distributions before choosing metrics.

The Confusion Matrix

The confusion matrix is the foundational tool for understanding classifier behavior. It breaks down all predictions into four categories based on the combination of predicted and actual class. Every classification metric can be derived from it.

	Predicted Class
	Predicted Positive	Predicted Negative
Actual Positive	TP True Positive	FN False Negative (Miss)
Actual Negative	FP False Positive (False Alarm)	TN True Negative

True Positive (TP) — Model predicted positive, and the true label is positive. Correct detection.
False Positive (FP) — Model predicted positive, but the true label is negative. Type I error; false alarm.
False Negative (FN) — Model predicted negative, but the true label is positive. Type II error; a miss.
True Negative (TN) — Model predicted negative, and the true label is negative. Correct rejection.

From the confusion matrix, all key metrics follow directly:

      Accuracy  = (TP + TN) / (TP + TN + FP + FN)

      Precision = TP / (TP + FP)

      Recall    = TP / (TP + FN)

      F1        = 2 * TP / (2*TP + FP + FN)

      Specificity (TNR) = TN / (TN + FP)

      FPR (Fall-out)    = FP / (FP + TN)

ROC Curve & AUC

Most classifiers produce a probability score rather than a hard class label. By sweeping the classification threshold from 0 to 1, you trace out a curve in (FPR, TPR) space — the Receiver Operating Characteristic (ROC) curve. The Area Under the Curve (AUC) summarizes this entire curve in a single number.

The AUC equals the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example — a threshold-independent measure of discriminative power.

Interpreting AUC

AUC = 1.0 — Perfect classifier. Can always separate positives from negatives.
AUC = 0.9+ — Excellent. Strong discriminative power.
AUC = 0.7–0.9 — Good. Acceptable for many real-world tasks.
AUC = 0.5 — No better than random guessing. Model has learned nothing.
AUC < 0.5 — Worse than random. Model predictions are systematically inverted.

ROC vs. Precision-Recall Curve

ROC curves can look optimistic on heavily imbalanced datasets because TN (true negatives) inflate the denominator of FPR. For tasks where the positive class is rare (fraud, disease, anomaly), the Precision-Recall (PR) curve is a more honest view of model performance.

Use ROC/AUC	Use PR Curve
Balanced classes	Highly imbalanced classes
Both classes equally important	Positive class is rare and critical
General benchmark comparison	Fraud, anomaly, medical detection

Choosing Your Threshold After Training

Once you have a trained model and its ROC curve, you can pick the operating threshold to meet your deployment requirements. The Youden Index (maximize TPR − FPR) finds the point furthest from the diagonal. For cost-sensitive scenarios, weight FP and FN costs explicitly to find the threshold that minimizes expected cost.

Regression Metrics

For regression tasks (predicting a continuous value), the appropriate metrics measure the magnitude and distribution of prediction errors. Each metric has different sensitivities to outliers and scale, making the choice non-trivial.

Metric	Formula	Scale-dependent?	Outlier-sensitive?	Best Used When
MAE (Mean Absolute Error)	`mean(\|y - ŷ\|)`	Yes	No (robust)	Outliers are present; you want median-like behavior; interpretable in original units
MSE (Mean Squared Error)	`mean((y - ŷ)²)`	Yes (squared)	Yes (strongly)	Large errors are especially harmful; differentiable — useful as a training loss
RMSE (Root MSE)	`sqrt(MSE)`	Yes	Yes	Same units as target; penalizes large errors; most common regression benchmark metric
R² (R-squared)	`1 - SS_res/SS_tot`	No (normalized)	Moderate	Explaining proportion of variance captured; comparing models on same dataset
MAPE (Mean Abs. % Error)	`mean(\|y-ŷ\|/\|y\|) × 100`	No (percentage)	No	Relative error matters; comparing across different scales; target never near zero
Huber Loss	MAE for large errors, MSE for small	Yes	Moderate (hybrid)	Robust training with some penalty for large deviations; best of MAE and MSE

Interpreting R²

R² = 1.0 means the model explains all variance in the target. R² = 0 means the model performs no better than predicting the mean. R² can be negative — meaning your model is worse than always predicting the mean. Adding more features always increases R² on training data, which is why Adjusted R² (penalizing extra features) is often preferred.

MAPE's Hidden Flaw

MAPE breaks down when the true value is zero or near zero, producing undefined or extremely large values. It is also asymmetric — underestimates are bounded by 100% while overestimates are unbounded. For demand forecasting with zero-values (e.g., stockouts), use sMAPE or WAPE instead.

Common Evaluation Mistakes

Mistake 1: Using Accuracy on Imbalanced Classes

A model predicting the majority class 100% of the time will report high accuracy on imbalanced data. Always check class balance first. If the minority class is <10% of the data, use F1, precision-recall AUC, or Matthews Correlation Coefficient (MCC) as your primary metric.

Mistake 2: Evaluating on Training Data

Training accuracy measures memorization, not generalization. A model with 100% training accuracy and 60% test accuracy has severe overfitting. Always hold out a test set, use cross-validation, and report only generalization performance in any meaningful evaluation.

Mistake 3: Ignoring Model Calibration

A model can have excellent AUC but terrible calibration — meaning it ranks examples correctly but its probability outputs are wildly miscalibrated (e.g., all "high confidence" predictions are actually correct only 40% of the time). For risk scoring, medical prognosis, or any application that uses probability values directly, apply Platt scaling or isotonic regression to calibrate probabilities after training.

Mistake 4: Ignoring Temporal and Distribution Shift

A model evaluated on test data sampled from the same distribution as training may perform very differently on data collected 6 months later. When deploying, monitor metrics continuously. Design offline evaluation to mimic production data distribution as closely as possible — same time period, same data sources, same pipeline.

Evaluation Metric Selection Checklist

Define what "a mistake" costs in your problem — false positives and false negatives often have very different consequences
Check class distribution — if imbalanced (>5:1 ratio), avoid accuracy as primary metric
Decide if probability calibration matters — if yes, add calibration evaluation (reliability diagrams, Brier score)
For regression, determine whether large errors are especially costly — if yes, prefer RMSE or Huber over MAE
Establish a baseline (random model, majority class predictor) and ensure your model meaningfully beats it
Report multiple metrics — no single metric tells the whole story