Model Monitoring & Drift Detection

📉 Why Models Degrade

A model trained on yesterday's data is increasingly wrong about today's world. Unlike traditional software which does exactly what it's programmed to do until someone changes it, ML models silently degrade as the world they were trained on diverges from the world they're predicting in.

The World Changes

The underlying patterns that make a model accurate are not permanent. User behavior, economic conditions, language usage, security threats, and sensor calibration all change continuously.

COVID-19: shattered recommendation systems trained on pre-pandemic behavior (travel, restaurants, entertainment)
Fraud models: attackers actively learn to evade detection, causing rapid concept drift
Language models: new slang, product names, and events appear faster than retraining cycles
Industrial sensors: calibration drift changes the input distribution without any semantic change

Data Drift (Covariate Shift)

The input distribution P(X) changes between training and serving time. The model's learned decision boundaries, calibrated on old data, become misaligned with the new distribution.

New user demographics reach the product after launch in a new market
Seasonal effects: summer vs winter behavior patterns
Upstream pipeline changes silently alter a feature's range or meaning
Sensor hardware upgrade changes baseline readings without code change

Concept Drift

The underlying relationship between inputs and outputs changes — P(Y|X) shifts. The features that once predicted the label no longer do so with the same reliability.

Customer churn: the signals that predicted churn 2 years ago have changed
Credit risk: economic downturns change which features predict default
Security: threat actors change tactics specifically to evade detection models
Much harder to detect than data drift — requires ground truth labels

Upstream Pipeline Changes

Models fail silently when the data infrastructure around them changes without corresponding model updates.

Schema changes: a column is renamed, removed, or its units change
Business logic changes: definition of "active user" or "completed purchase" evolves
Data source migration: switch from one database to another with subtle differences
ETL bugs: incorrect join logic silently corrupts feature values

🌊 Types of Drift

Drift Type	Definition	Detection Method	Example
Covariate / Data Drift	P(X) changes; the input feature distribution shifts	Statistical tests on feature distributions vs training baseline (PSI, KS, chi-squared)	Age distribution of users shifts from 25–35 to 35–55 as product matures
Prior Probability Drift	P(Y) changes; the label distribution shifts even if inputs stay the same	Monitor prediction distribution vs training label distribution	Fraud rate doubles during holiday season; model predicts too few fraudulent transactions
Concept Drift	P(Y\|X) changes; the relationship between inputs and labels changes	Requires ground-truth labels; track accuracy/AUC vs time; error rate analysis	Fraud attack vector changes; previously safe transaction patterns now fraudulent
Label Drift	Labeling criteria or annotation guidelines change over time	Compare annotation consistency across time; inter-annotator agreement	Support ticket "severity" definition changed by operations team; model trained on old criteria
Feature Drift	Individual feature statistics change (mean, std, range, cardinality)	Per-feature statistical monitoring; alert on mean shift, new categorical values	Transaction amount feature mean doubles after new premium product launch
Prediction Drift	Model's output distribution changes — predictions cluster differently	Monitor distribution of predicted probabilities / class assignments over time	Recommendation model starts recommending only a narrow set of items

Concept Drift Is the Hardest to Detect

Covariate drift can be detected immediately from input feature distributions alone. Concept drift requires ground-truth labels to be observed — and in many systems, true labels arrive days, weeks, or months after the prediction (loan default: 90 days; cancer recurrence: months to years). This feedback lag means concept drift often goes undetected until model performance has already significantly degraded. Designing short feedback loops — proxy labels, human review queues, rapid A/B testing — is essential for systems where concept drift is expected.

📐 Statistical Drift Detection Methods

Population Stability Index (PSI)

PSI is the most widely used drift metric in financial services. It measures the shift between a reference distribution (training) and current distribution.

import numpy as np

def psi(expected, actual, bins=10):
    """
    Calculate PSI between expected (training)
    and actual (serving) distributions.
    """
    breakpoints = np.percentile(expected,
        np.linspace(0, 100, bins + 1))
    breakpoints = np.unique(breakpoints)

    exp_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    act_pct = np.histogram(actual,   breakpoints)[0] / len(actual)

    # Avoid log(0) with small epsilon
    exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
    act_pct = np.where(act_pct == 0, 1e-6, act_pct)

    psi_value = np.sum(
        (act_pct - exp_pct) * np.log(act_pct / exp_pct)
    )
    return psi_value

Interpretation: PSI < 0.1: no drift; 0.1–0.2: moderate drift; >0.2: significant drift — investigate.

Kolmogorov-Smirnov (KS) Test

The KS test measures the maximum absolute difference between two empirical cumulative distribution functions. Ideal for continuous numerical features.

from scipy.stats import ks_2samp
import numpy as np

def ks_drift_test(reference, current, alpha=0.05):
    """
    Two-sample KS test for distribution drift.
    Returns: statistic, p-value, drift_detected
    """
    stat, p_value = ks_2samp(reference, current)
    drift = p_value < alpha
    return {
        "statistic": stat,
        "p_value": p_value,
        "drift_detected": drift,
        "severity": "high" if stat > 0.2
                    else "medium" if stat > 0.1
                    else "low"
    }

Non-parametric — no normality assumption
Sensitive to differences in the tails of distributions
Requires sufficient sample size (N > 30 per group)

Chi-Squared Test

For categorical features, chi-squared tests whether the observed category frequencies deviate significantly from the expected (training) frequencies.

from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

def chi2_drift_test(ref_counts, cur_counts, alpha=0.05):
    """
    Chi-squared test for categorical drift.
    ref_counts, cur_counts: dict of {category: count}
    """
    categories = sorted(set(ref_counts) | set(cur_counts))
    ref = np.array([ref_counts.get(c, 0) for c in categories])
    cur = np.array([cur_counts.get(c, 0) for c in categories])

    # Contingency table: rows = [reference, current]
    table = np.array([ref, cur])
    chi2, p, dof, expected = chi2_contingency(table)
    return {
        "chi2": chi2,
        "p_value": p,
        "drift_detected": p < alpha
    }

Jensen-Shannon Divergence & CUSUM

JSD is a symmetric version of KL divergence bounded between 0 and 1 — making it easier to interpret as a drift score than raw KL.

from scipy.spatial.distance import jensenshannon
import numpy as np

def js_drift(ref_hist, cur_hist):
    """
    Jensen-Shannon divergence between histograms.
    Returns value in [0, 1]; 0 = identical distributions.
    """
    ref_hist = ref_hist / ref_hist.sum()
    cur_hist = cur_hist / cur_hist.sum()
    return jensenshannon(ref_hist, cur_hist)

# CUSUM for sequential drift detection
def cusum(values, target_mean, threshold=5, slack=0.5):
    """Cumulative sum control chart for sequential data."""
    cusum_pos, cusum_neg = 0, 0
    for v in values:
        cusum_pos = max(0, cusum_pos + v - target_mean - slack)
        cusum_neg = max(0, cusum_neg + target_mean - v - slack)
        if cusum_pos > threshold or cusum_neg > threshold:
            return True, cusum_pos, cusum_neg
    return False, cusum_pos, cusum_neg

Drift Detection Methods Summary

Method	Feature Type	Threshold	Python Library
PSI (Population Stability Index)	Continuous, ordinal	<0.1 stable; 0.1–0.2 warning; >0.2 critical	Custom / `evidently`
KS (Kolmogorov-Smirnov) test	Continuous	p-value < 0.05 indicates drift; stat > 0.1 for severity	`scipy.stats.ks_2samp`
Chi-squared test	Categorical	p-value < 0.05 indicates drift	`scipy.stats.chi2_contingency`
Jensen-Shannon Divergence	Any (histogram)	>0.1 warning; >0.2 critical	`scipy.spatial.distance.jensenshannon`
CUSUM	Sequential time series	Threshold tuned per application (typically 4–5 sigma)	Custom / `ruptures`
Maximum Mean Discrepancy	High-dimensional vectors, embeddings	Bootstrap-based p-value	`alibi-detect`

🏗️ Monitoring Infrastructure

What to Log

Comprehensive logging is the foundation of monitoring. You cannot detect what you don't observe.

Input features: all features sent to the model at serving time (for drift detection)
Predictions: raw scores/logits, predicted class, confidence
Latency: preprocessing time, inference time, total request time (P50, P95, P99)
Error rates: 4xx/5xx rates, preprocessing failures, schema validation failures
Ground truth: link predictions to delayed labels when available
Request metadata: timestamp, user segment, model version, server ID

Evidently AI

Open-source Python library for ML model monitoring and data quality evaluation. Produces interactive HTML reports and JSON metrics for integration into CI/CD and dashboards.

Data drift report: statistical tests on all features with visual distributions
Target drift: monitor prediction distribution and label drift
Data quality report: missing values, duplicates, schema violations
Works offline (batch comparison) and as a monitoring service
Integrates with MLflow, Grafana, and custom dashboards via JSON output

Open SourcePython First

Grafana + Prometheus

The de-facto standard for real-time infrastructure monitoring, extended for ML by exposing model-specific metrics via a Prometheus endpoint from the serving container.

Expose custom metrics: prediction score distribution, drift PSI, feature means
Grafana dashboards: time-series visualizations of all logged metrics
Alertmanager: route alerts to Slack, PagerDuty, email based on rules
Low overhead: Prometheus scrape model container's /metrics endpoint every 15s

Open SourceReal-time

Commercial Platforms

Managed monitoring platforms reduce operational overhead at the cost of vendor lock-in and subscription fees.

WhyLabs: automatic statistical profiling; drift detection; integrates via whylogs SDK in one line
Arize AI: production observability; embedding drift; model performance tracing; strong explainability features
Fiddler AI: enterprise focus; model explainability + monitoring; great for regulated industries
AWS SageMaker Model Monitor: native to SageMaker; automatic baseline capture; scheduled monitoring jobs

Monitoring Tool Comparison

Tool	Open Source	Key Capability	Integration Complexity
Evidently AI	Yes (Apache 2.0)	Comprehensive drift + quality HTML reports; JSON metrics for CI/CD gates	Low — Python library, drop-in
Grafana + Prometheus	Yes	Real-time dashboards; flexible alerting; standard infrastructure monitoring extended for ML	Medium — requires instrumentation + dashboard setup
WhyLabs	whylogs SDK is open; platform is SaaS	Automatic statistical profiling with minimal code changes; privacy-preserving (sends stats, not data)	Low — single SDK call per batch
Arize AI	No (SaaS)	Production observability; embedding drift; model performance tracing	Medium — SDK integration + data schema configuration
Alibi Detect	Yes (Apache 2.0)	Advanced drift detectors (MMD, LSDD, classifier-based); works on tabular, text, images	Medium — requires detector fitting and integration
SageMaker Model Monitor	No (AWS managed)	Automated baseline capture; scheduled drift jobs; direct integration with SageMaker endpoints	Low for SageMaker users; high lock-in

🚨 Alerting & Response Playbook

Alert Tier Design

Not all drift is equally urgent. A tiered alerting strategy prevents alert fatigue while ensuring critical issues get immediate attention.

P1 — Pager (immediate): serving error rate >1%, P99 latency >5×baseline, model output all same class
P2 — Slack alert (within the hour): PSI >0.2 on key features, prediction distribution significantly shifted
P3 — Dashboard / daily digest: gradual feature drift, slight prediction score distribution shift, PSI 0.1–0.2
Set all thresholds before deployment — not after the first incident

On-Call ML Runbook Template

Every production model should have a documented runbook that any engineer can follow during an incident.

Alert received: identify which alert, which model, and which metric triggered
Impact assessment: what business function is affected? How many users?
Rollback decision: if serving errors >2%, roll back immediately — investigate later
Root cause: check upstream data pipeline, feature drift reports, recent deployments
Resolution: data fix / retrain / rollback; document decision with evidence
Post-mortem: scheduled within 5 business days for all P1/P2 incidents

Automated Retraining Triggers

When drift is detected and confirmed, retraining should be as automated as possible to minimize the window of degraded model performance.

PSI >0.2 on 3+ features → trigger data validation + retraining pipeline
Online accuracy drops below quality gate → trigger retraining + evaluation
Scheduled weekly/monthly retraining regardless of drift metrics
Retraining requires review: verify new training data is of acceptable quality before replacing champion

Human-in-the-Loop for Concept Drift

Automated retraining handles covariate drift well. Concept drift — where the fundamental task has changed — requires human judgment.

Flag cases where model confidence is high but early labels disagree
Sample predictions for periodic human review queue
Involve domain experts when retraining on potentially concept-drifted data
Consider whether the model's objective function itself needs to change

Instrument Monitoring Before You Deploy

The single most common MLOps mistake is treating monitoring as something to add "after we see a problem." By then, the problem has already affected users, silently, for days or weeks. Monitoring infrastructure — logging, dashboards, alert thresholds, runbook — should all be defined and tested before the first production deployment. If you can't monitor it, you're not ready to deploy it.

Post-Mortem Template

Section	Content
Incident summary	One-paragraph description: what failed, when, how long, business impact
Timeline	Chronological sequence: first alert, triage start, root cause identified, resolution applied, monitoring confirmed stable
Root cause	Technical description of the drift/failure mechanism; supporting evidence (charts, logs)
Impact	Number of affected users/requests; downstream business metric impact; SLA breach assessment
Detection gap	How long between drift onset and alert? Why wasn't it caught sooner?
Resolution actions	What was done: rollback / retrain / data fix / threshold adjustment
Prevention items	Action items with owners and due dates to prevent recurrence (monitoring improvements, pipeline changes, alerting threshold adjustments)