A model trained on yesterday's data is increasingly wrong about today's world. Unlike traditional software which does exactly what it's programmed to do until someone changes it, ML models silently degrade as the world they were trained on diverges from the world they're predicting in.
The World Changes
The underlying patterns that make a model accurate are not permanent. User behavior, economic conditions, language usage, security threats, and sensor calibration all change continuously.
- COVID-19: shattered recommendation systems trained on pre-pandemic behavior (travel, restaurants, entertainment)
- Fraud models: attackers actively learn to evade detection, causing rapid concept drift
- Language models: new slang, product names, and events appear faster than retraining cycles
- Industrial sensors: calibration drift changes the input distribution without any semantic change
Data Drift (Covariate Shift)
The input distribution P(X) changes between training and serving time. The model's learned decision boundaries, calibrated on old data, become misaligned with the new distribution.
- New user demographics reach the product after launch in a new market
- Seasonal effects: summer vs winter behavior patterns
- Upstream pipeline changes silently alter a feature's range or meaning
- Sensor hardware upgrade changes baseline readings without code change
Concept Drift
The underlying relationship between inputs and outputs changes β P(Y|X) shifts. The features that once predicted the label no longer do so with the same reliability.
- Customer churn: the signals that predicted churn 2 years ago have changed
- Credit risk: economic downturns change which features predict default
- Security: threat actors change tactics specifically to evade detection models
- Much harder to detect than data drift β requires ground truth labels
Upstream Pipeline Changes
Models fail silently when the data infrastructure around them changes without corresponding model updates.
- Schema changes: a column is renamed, removed, or its units change
- Business logic changes: definition of "active user" or "completed purchase" evolves
- Data source migration: switch from one database to another with subtle differences
- ETL bugs: incorrect join logic silently corrupts feature values
| Drift Type | Definition | Detection Method | Example |
|---|---|---|---|
| Covariate / Data Drift | P(X) changes; the input feature distribution shifts | Statistical tests on feature distributions vs training baseline (PSI, KS, chi-squared) | Age distribution of users shifts from 25β35 to 35β55 as product matures |
| Prior Probability Drift | P(Y) changes; the label distribution shifts even if inputs stay the same | Monitor prediction distribution vs training label distribution | Fraud rate doubles during holiday season; model predicts too few fraudulent transactions |
| Concept Drift | P(Y|X) changes; the relationship between inputs and labels changes | Requires ground-truth labels; track accuracy/AUC vs time; error rate analysis | Fraud attack vector changes; previously safe transaction patterns now fraudulent |
| Label Drift | Labeling criteria or annotation guidelines change over time | Compare annotation consistency across time; inter-annotator agreement | Support ticket "severity" definition changed by operations team; model trained on old criteria |
| Feature Drift | Individual feature statistics change (mean, std, range, cardinality) | Per-feature statistical monitoring; alert on mean shift, new categorical values | Transaction amount feature mean doubles after new premium product launch |
| Prediction Drift | Model's output distribution changes β predictions cluster differently | Monitor distribution of predicted probabilities / class assignments over time | Recommendation model starts recommending only a narrow set of items |
Concept Drift Is the Hardest to Detect
Covariate drift can be detected immediately from input feature distributions alone. Concept drift requires ground-truth labels to be observed β and in many systems, true labels arrive days, weeks, or months after the prediction (loan default: 90 days; cancer recurrence: months to years). This feedback lag means concept drift often goes undetected until model performance has already significantly degraded. Designing short feedback loops β proxy labels, human review queues, rapid A/B testing β is essential for systems where concept drift is expected.
Population Stability Index (PSI)
PSI is the most widely used drift metric in financial services. It measures the shift between a reference distribution (training) and current distribution.
import numpy as np
def psi(expected, actual, bins=10):
"""
Calculate PSI between expected (training)
and actual (serving) distributions.
"""
breakpoints = np.percentile(expected,
np.linspace(0, 100, bins + 1))
breakpoints = np.unique(breakpoints)
exp_pct = np.histogram(expected, breakpoints)[0] / len(expected)
act_pct = np.histogram(actual, breakpoints)[0] / len(actual)
# Avoid log(0) with small epsilon
exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
act_pct = np.where(act_pct == 0, 1e-6, act_pct)
psi_value = np.sum(
(act_pct - exp_pct) * np.log(act_pct / exp_pct)
)
return psi_value
Interpretation: PSI < 0.1: no drift; 0.1β0.2: moderate drift; >0.2: significant drift β investigate.
Kolmogorov-Smirnov (KS) Test
The KS test measures the maximum absolute difference between two empirical cumulative distribution functions. Ideal for continuous numerical features.
from scipy.stats import ks_2samp
import numpy as np
def ks_drift_test(reference, current, alpha=0.05):
"""
Two-sample KS test for distribution drift.
Returns: statistic, p-value, drift_detected
"""
stat, p_value = ks_2samp(reference, current)
drift = p_value < alpha
return {
"statistic": stat,
"p_value": p_value,
"drift_detected": drift,
"severity": "high" if stat > 0.2
else "medium" if stat > 0.1
else "low"
}
- Non-parametric β no normality assumption
- Sensitive to differences in the tails of distributions
- Requires sufficient sample size (N > 30 per group)
Chi-Squared Test
For categorical features, chi-squared tests whether the observed category frequencies deviate significantly from the expected (training) frequencies.
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np
def chi2_drift_test(ref_counts, cur_counts, alpha=0.05):
"""
Chi-squared test for categorical drift.
ref_counts, cur_counts: dict of {category: count}
"""
categories = sorted(set(ref_counts) | set(cur_counts))
ref = np.array([ref_counts.get(c, 0) for c in categories])
cur = np.array([cur_counts.get(c, 0) for c in categories])
# Contingency table: rows = [reference, current]
table = np.array([ref, cur])
chi2, p, dof, expected = chi2_contingency(table)
return {
"chi2": chi2,
"p_value": p,
"drift_detected": p < alpha
}
Jensen-Shannon Divergence & CUSUM
JSD is a symmetric version of KL divergence bounded between 0 and 1 β making it easier to interpret as a drift score than raw KL.
from scipy.spatial.distance import jensenshannon
import numpy as np
def js_drift(ref_hist, cur_hist):
"""
Jensen-Shannon divergence between histograms.
Returns value in [0, 1]; 0 = identical distributions.
"""
ref_hist = ref_hist / ref_hist.sum()
cur_hist = cur_hist / cur_hist.sum()
return jensenshannon(ref_hist, cur_hist)
# CUSUM for sequential drift detection
def cusum(values, target_mean, threshold=5, slack=0.5):
"""Cumulative sum control chart for sequential data."""
cusum_pos, cusum_neg = 0, 0
for v in values:
cusum_pos = max(0, cusum_pos + v - target_mean - slack)
cusum_neg = max(0, cusum_neg + target_mean - v - slack)
if cusum_pos > threshold or cusum_neg > threshold:
return True, cusum_pos, cusum_neg
return False, cusum_pos, cusum_neg
Drift Detection Methods Summary
| Method | Feature Type | Threshold | Python Library |
|---|---|---|---|
| PSI (Population Stability Index) | Continuous, ordinal | <0.1 stable; 0.1β0.2 warning; >0.2 critical | Custom / evidently |
| KS (Kolmogorov-Smirnov) test | Continuous | p-value < 0.05 indicates drift; stat > 0.1 for severity | scipy.stats.ks_2samp |
| Chi-squared test | Categorical | p-value < 0.05 indicates drift | scipy.stats.chi2_contingency |
| Jensen-Shannon Divergence | Any (histogram) | >0.1 warning; >0.2 critical | scipy.spatial.distance.jensenshannon |
| CUSUM | Sequential time series | Threshold tuned per application (typically 4β5 sigma) | Custom / ruptures |
| Maximum Mean Discrepancy | High-dimensional vectors, embeddings | Bootstrap-based p-value | alibi-detect |
What to Log
Comprehensive logging is the foundation of monitoring. You cannot detect what you don't observe.
- Input features: all features sent to the model at serving time (for drift detection)
- Predictions: raw scores/logits, predicted class, confidence
- Latency: preprocessing time, inference time, total request time (P50, P95, P99)
- Error rates: 4xx/5xx rates, preprocessing failures, schema validation failures
- Ground truth: link predictions to delayed labels when available
- Request metadata: timestamp, user segment, model version, server ID
Evidently AI
Open-source Python library for ML model monitoring and data quality evaluation. Produces interactive HTML reports and JSON metrics for integration into CI/CD and dashboards.
- Data drift report: statistical tests on all features with visual distributions
- Target drift: monitor prediction distribution and label drift
- Data quality report: missing values, duplicates, schema violations
- Works offline (batch comparison) and as a monitoring service
- Integrates with MLflow, Grafana, and custom dashboards via JSON output
Grafana + Prometheus
The de-facto standard for real-time infrastructure monitoring, extended for ML by exposing model-specific metrics via a Prometheus endpoint from the serving container.
- Expose custom metrics: prediction score distribution, drift PSI, feature means
- Grafana dashboards: time-series visualizations of all logged metrics
- Alertmanager: route alerts to Slack, PagerDuty, email based on rules
- Low overhead: Prometheus scrape model container's
/metricsendpoint every 15s
Commercial Platforms
Managed monitoring platforms reduce operational overhead at the cost of vendor lock-in and subscription fees.
- WhyLabs: automatic statistical profiling; drift detection; integrates via whylogs SDK in one line
- Arize AI: production observability; embedding drift; model performance tracing; strong explainability features
- Fiddler AI: enterprise focus; model explainability + monitoring; great for regulated industries
- AWS SageMaker Model Monitor: native to SageMaker; automatic baseline capture; scheduled monitoring jobs
Monitoring Tool Comparison
| Tool | Open Source | Key Capability | Integration Complexity |
|---|---|---|---|
| Evidently AI | Yes (Apache 2.0) | Comprehensive drift + quality HTML reports; JSON metrics for CI/CD gates | Low β Python library, drop-in |
| Grafana + Prometheus | Yes | Real-time dashboards; flexible alerting; standard infrastructure monitoring extended for ML | Medium β requires instrumentation + dashboard setup |
| WhyLabs | whylogs SDK is open; platform is SaaS | Automatic statistical profiling with minimal code changes; privacy-preserving (sends stats, not data) | Low β single SDK call per batch |
| Arize AI | No (SaaS) | Production observability; embedding drift; model performance tracing | Medium β SDK integration + data schema configuration |
| Alibi Detect | Yes (Apache 2.0) | Advanced drift detectors (MMD, LSDD, classifier-based); works on tabular, text, images | Medium β requires detector fitting and integration |
| SageMaker Model Monitor | No (AWS managed) | Automated baseline capture; scheduled drift jobs; direct integration with SageMaker endpoints | Low for SageMaker users; high lock-in |
Alert Tier Design
Not all drift is equally urgent. A tiered alerting strategy prevents alert fatigue while ensuring critical issues get immediate attention.
- P1 β Pager (immediate): serving error rate >1%, P99 latency >5Γbaseline, model output all same class
- P2 β Slack alert (within the hour): PSI >0.2 on key features, prediction distribution significantly shifted
- P3 β Dashboard / daily digest: gradual feature drift, slight prediction score distribution shift, PSI 0.1β0.2
- Set all thresholds before deployment β not after the first incident
On-Call ML Runbook Template
Every production model should have a documented runbook that any engineer can follow during an incident.
- Alert received: identify which alert, which model, and which metric triggered
- Impact assessment: what business function is affected? How many users?
- Rollback decision: if serving errors >2%, roll back immediately β investigate later
- Root cause: check upstream data pipeline, feature drift reports, recent deployments
- Resolution: data fix / retrain / rollback; document decision with evidence
- Post-mortem: scheduled within 5 business days for all P1/P2 incidents
Automated Retraining Triggers
When drift is detected and confirmed, retraining should be as automated as possible to minimize the window of degraded model performance.
- PSI >0.2 on 3+ features β trigger data validation + retraining pipeline
- Online accuracy drops below quality gate β trigger retraining + evaluation
- Scheduled weekly/monthly retraining regardless of drift metrics
- Retraining requires review: verify new training data is of acceptable quality before replacing champion
Human-in-the-Loop for Concept Drift
Automated retraining handles covariate drift well. Concept drift β where the fundamental task has changed β requires human judgment.
- Flag cases where model confidence is high but early labels disagree
- Sample predictions for periodic human review queue
- Involve domain experts when retraining on potentially concept-drifted data
- Consider whether the model's objective function itself needs to change
Instrument Monitoring Before You Deploy
The single most common MLOps mistake is treating monitoring as something to add "after we see a problem." By then, the problem has already affected users, silently, for days or weeks. Monitoring infrastructure β logging, dashboards, alert thresholds, runbook β should all be defined and tested before the first production deployment. If you can't monitor it, you're not ready to deploy it.
Post-Mortem Template
| Section | Content |
|---|---|
| Incident summary | One-paragraph description: what failed, when, how long, business impact |
| Timeline | Chronological sequence: first alert, triage start, root cause identified, resolution applied, monitoring confirmed stable |
| Root cause | Technical description of the drift/failure mechanism; supporting evidence (charts, logs) |
| Impact | Number of affected users/requests; downstream business metric impact; SLA breach assessment |
| Detection gap | How long between drift onset and alert? Why wasn't it caught sooner? |
| Resolution actions | What was done: rollback / retrain / data fix / threshold adjustment |
| Prevention items | Action items with owners and due dates to prevent recurrence (monitoring improvements, pipeline changes, alerting threshold adjustments) |