MLOps Series: Model Versioning & Tracking Model Registry & Artifacts CI/CD for ML Monitoring & Drift A/B Testing & Canary Containerization & Orchestration
← CI/CD for ML A/B Testing & Canary →
⏱ 9 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025
πŸ“‰ Why Models Degrade

A model trained on yesterday's data is increasingly wrong about today's world. Unlike traditional software which does exactly what it's programmed to do until someone changes it, ML models silently degrade as the world they were trained on diverges from the world they're predicting in.

The World Changes

The underlying patterns that make a model accurate are not permanent. User behavior, economic conditions, language usage, security threats, and sensor calibration all change continuously.

  • COVID-19: shattered recommendation systems trained on pre-pandemic behavior (travel, restaurants, entertainment)
  • Fraud models: attackers actively learn to evade detection, causing rapid concept drift
  • Language models: new slang, product names, and events appear faster than retraining cycles
  • Industrial sensors: calibration drift changes the input distribution without any semantic change

Data Drift (Covariate Shift)

The input distribution P(X) changes between training and serving time. The model's learned decision boundaries, calibrated on old data, become misaligned with the new distribution.

  • New user demographics reach the product after launch in a new market
  • Seasonal effects: summer vs winter behavior patterns
  • Upstream pipeline changes silently alter a feature's range or meaning
  • Sensor hardware upgrade changes baseline readings without code change

Concept Drift

The underlying relationship between inputs and outputs changes β€” P(Y|X) shifts. The features that once predicted the label no longer do so with the same reliability.

  • Customer churn: the signals that predicted churn 2 years ago have changed
  • Credit risk: economic downturns change which features predict default
  • Security: threat actors change tactics specifically to evade detection models
  • Much harder to detect than data drift β€” requires ground truth labels

Upstream Pipeline Changes

Models fail silently when the data infrastructure around them changes without corresponding model updates.

  • Schema changes: a column is renamed, removed, or its units change
  • Business logic changes: definition of "active user" or "completed purchase" evolves
  • Data source migration: switch from one database to another with subtle differences
  • ETL bugs: incorrect join logic silently corrupts feature values
🌊 Types of Drift
Drift TypeDefinitionDetection MethodExample
Covariate / Data Drift P(X) changes; the input feature distribution shifts Statistical tests on feature distributions vs training baseline (PSI, KS, chi-squared) Age distribution of users shifts from 25–35 to 35–55 as product matures
Prior Probability Drift P(Y) changes; the label distribution shifts even if inputs stay the same Monitor prediction distribution vs training label distribution Fraud rate doubles during holiday season; model predicts too few fraudulent transactions
Concept Drift P(Y|X) changes; the relationship between inputs and labels changes Requires ground-truth labels; track accuracy/AUC vs time; error rate analysis Fraud attack vector changes; previously safe transaction patterns now fraudulent
Label Drift Labeling criteria or annotation guidelines change over time Compare annotation consistency across time; inter-annotator agreement Support ticket "severity" definition changed by operations team; model trained on old criteria
Feature Drift Individual feature statistics change (mean, std, range, cardinality) Per-feature statistical monitoring; alert on mean shift, new categorical values Transaction amount feature mean doubles after new premium product launch
Prediction Drift Model's output distribution changes β€” predictions cluster differently Monitor distribution of predicted probabilities / class assignments over time Recommendation model starts recommending only a narrow set of items

Concept Drift Is the Hardest to Detect

Covariate drift can be detected immediately from input feature distributions alone. Concept drift requires ground-truth labels to be observed β€” and in many systems, true labels arrive days, weeks, or months after the prediction (loan default: 90 days; cancer recurrence: months to years). This feedback lag means concept drift often goes undetected until model performance has already significantly degraded. Designing short feedback loops β€” proxy labels, human review queues, rapid A/B testing β€” is essential for systems where concept drift is expected.

πŸ“ Statistical Drift Detection Methods

Population Stability Index (PSI)

PSI is the most widely used drift metric in financial services. It measures the shift between a reference distribution (training) and current distribution.

import numpy as np

def psi(expected, actual, bins=10):
    """
    Calculate PSI between expected (training)
    and actual (serving) distributions.
    """
    breakpoints = np.percentile(expected,
        np.linspace(0, 100, bins + 1))
    breakpoints = np.unique(breakpoints)

    exp_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    act_pct = np.histogram(actual,   breakpoints)[0] / len(actual)

    # Avoid log(0) with small epsilon
    exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
    act_pct = np.where(act_pct == 0, 1e-6, act_pct)

    psi_value = np.sum(
        (act_pct - exp_pct) * np.log(act_pct / exp_pct)
    )
    return psi_value

Interpretation: PSI < 0.1: no drift; 0.1–0.2: moderate drift; >0.2: significant drift β€” investigate.

Kolmogorov-Smirnov (KS) Test

The KS test measures the maximum absolute difference between two empirical cumulative distribution functions. Ideal for continuous numerical features.

from scipy.stats import ks_2samp
import numpy as np

def ks_drift_test(reference, current, alpha=0.05):
    """
    Two-sample KS test for distribution drift.
    Returns: statistic, p-value, drift_detected
    """
    stat, p_value = ks_2samp(reference, current)
    drift = p_value < alpha
    return {
        "statistic": stat,
        "p_value": p_value,
        "drift_detected": drift,
        "severity": "high" if stat > 0.2
                    else "medium" if stat > 0.1
                    else "low"
    }
  • Non-parametric β€” no normality assumption
  • Sensitive to differences in the tails of distributions
  • Requires sufficient sample size (N > 30 per group)

Chi-Squared Test

For categorical features, chi-squared tests whether the observed category frequencies deviate significantly from the expected (training) frequencies.

from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

def chi2_drift_test(ref_counts, cur_counts, alpha=0.05):
    """
    Chi-squared test for categorical drift.
    ref_counts, cur_counts: dict of {category: count}
    """
    categories = sorted(set(ref_counts) | set(cur_counts))
    ref = np.array([ref_counts.get(c, 0) for c in categories])
    cur = np.array([cur_counts.get(c, 0) for c in categories])

    # Contingency table: rows = [reference, current]
    table = np.array([ref, cur])
    chi2, p, dof, expected = chi2_contingency(table)
    return {
        "chi2": chi2,
        "p_value": p,
        "drift_detected": p < alpha
    }

Jensen-Shannon Divergence & CUSUM

JSD is a symmetric version of KL divergence bounded between 0 and 1 β€” making it easier to interpret as a drift score than raw KL.

from scipy.spatial.distance import jensenshannon
import numpy as np

def js_drift(ref_hist, cur_hist):
    """
    Jensen-Shannon divergence between histograms.
    Returns value in [0, 1]; 0 = identical distributions.
    """
    ref_hist = ref_hist / ref_hist.sum()
    cur_hist = cur_hist / cur_hist.sum()
    return jensenshannon(ref_hist, cur_hist)

# CUSUM for sequential drift detection
def cusum(values, target_mean, threshold=5, slack=0.5):
    """Cumulative sum control chart for sequential data."""
    cusum_pos, cusum_neg = 0, 0
    for v in values:
        cusum_pos = max(0, cusum_pos + v - target_mean - slack)
        cusum_neg = max(0, cusum_neg + target_mean - v - slack)
        if cusum_pos > threshold or cusum_neg > threshold:
            return True, cusum_pos, cusum_neg
    return False, cusum_pos, cusum_neg

Drift Detection Methods Summary

MethodFeature TypeThresholdPython Library
PSI (Population Stability Index)Continuous, ordinal<0.1 stable; 0.1–0.2 warning; >0.2 criticalCustom / evidently
KS (Kolmogorov-Smirnov) testContinuousp-value < 0.05 indicates drift; stat > 0.1 for severityscipy.stats.ks_2samp
Chi-squared testCategoricalp-value < 0.05 indicates driftscipy.stats.chi2_contingency
Jensen-Shannon DivergenceAny (histogram)>0.1 warning; >0.2 criticalscipy.spatial.distance.jensenshannon
CUSUMSequential time seriesThreshold tuned per application (typically 4–5 sigma)Custom / ruptures
Maximum Mean DiscrepancyHigh-dimensional vectors, embeddingsBootstrap-based p-valuealibi-detect
πŸ—οΈ Monitoring Infrastructure

What to Log

Comprehensive logging is the foundation of monitoring. You cannot detect what you don't observe.

  • Input features: all features sent to the model at serving time (for drift detection)
  • Predictions: raw scores/logits, predicted class, confidence
  • Latency: preprocessing time, inference time, total request time (P50, P95, P99)
  • Error rates: 4xx/5xx rates, preprocessing failures, schema validation failures
  • Ground truth: link predictions to delayed labels when available
  • Request metadata: timestamp, user segment, model version, server ID

Evidently AI

Open-source Python library for ML model monitoring and data quality evaluation. Produces interactive HTML reports and JSON metrics for integration into CI/CD and dashboards.

  • Data drift report: statistical tests on all features with visual distributions
  • Target drift: monitor prediction distribution and label drift
  • Data quality report: missing values, duplicates, schema violations
  • Works offline (batch comparison) and as a monitoring service
  • Integrates with MLflow, Grafana, and custom dashboards via JSON output
Open SourcePython First

Grafana + Prometheus

The de-facto standard for real-time infrastructure monitoring, extended for ML by exposing model-specific metrics via a Prometheus endpoint from the serving container.

  • Expose custom metrics: prediction score distribution, drift PSI, feature means
  • Grafana dashboards: time-series visualizations of all logged metrics
  • Alertmanager: route alerts to Slack, PagerDuty, email based on rules
  • Low overhead: Prometheus scrape model container's /metrics endpoint every 15s
Open SourceReal-time

Commercial Platforms

Managed monitoring platforms reduce operational overhead at the cost of vendor lock-in and subscription fees.

  • WhyLabs: automatic statistical profiling; drift detection; integrates via whylogs SDK in one line
  • Arize AI: production observability; embedding drift; model performance tracing; strong explainability features
  • Fiddler AI: enterprise focus; model explainability + monitoring; great for regulated industries
  • AWS SageMaker Model Monitor: native to SageMaker; automatic baseline capture; scheduled monitoring jobs

Monitoring Tool Comparison

ToolOpen SourceKey CapabilityIntegration Complexity
Evidently AIYes (Apache 2.0)Comprehensive drift + quality HTML reports; JSON metrics for CI/CD gatesLow β€” Python library, drop-in
Grafana + PrometheusYesReal-time dashboards; flexible alerting; standard infrastructure monitoring extended for MLMedium β€” requires instrumentation + dashboard setup
WhyLabswhylogs SDK is open; platform is SaaSAutomatic statistical profiling with minimal code changes; privacy-preserving (sends stats, not data)Low β€” single SDK call per batch
Arize AINo (SaaS)Production observability; embedding drift; model performance tracingMedium β€” SDK integration + data schema configuration
Alibi DetectYes (Apache 2.0)Advanced drift detectors (MMD, LSDD, classifier-based); works on tabular, text, imagesMedium β€” requires detector fitting and integration
SageMaker Model MonitorNo (AWS managed)Automated baseline capture; scheduled drift jobs; direct integration with SageMaker endpointsLow for SageMaker users; high lock-in
🚨 Alerting & Response Playbook

Alert Tier Design

Not all drift is equally urgent. A tiered alerting strategy prevents alert fatigue while ensuring critical issues get immediate attention.

  • P1 β€” Pager (immediate): serving error rate >1%, P99 latency >5Γ—baseline, model output all same class
  • P2 β€” Slack alert (within the hour): PSI >0.2 on key features, prediction distribution significantly shifted
  • P3 β€” Dashboard / daily digest: gradual feature drift, slight prediction score distribution shift, PSI 0.1–0.2
  • Set all thresholds before deployment β€” not after the first incident

On-Call ML Runbook Template

Every production model should have a documented runbook that any engineer can follow during an incident.

  • Alert received: identify which alert, which model, and which metric triggered
  • Impact assessment: what business function is affected? How many users?
  • Rollback decision: if serving errors >2%, roll back immediately β€” investigate later
  • Root cause: check upstream data pipeline, feature drift reports, recent deployments
  • Resolution: data fix / retrain / rollback; document decision with evidence
  • Post-mortem: scheduled within 5 business days for all P1/P2 incidents

Automated Retraining Triggers

When drift is detected and confirmed, retraining should be as automated as possible to minimize the window of degraded model performance.

  • PSI >0.2 on 3+ features β†’ trigger data validation + retraining pipeline
  • Online accuracy drops below quality gate β†’ trigger retraining + evaluation
  • Scheduled weekly/monthly retraining regardless of drift metrics
  • Retraining requires review: verify new training data is of acceptable quality before replacing champion

Human-in-the-Loop for Concept Drift

Automated retraining handles covariate drift well. Concept drift β€” where the fundamental task has changed β€” requires human judgment.

  • Flag cases where model confidence is high but early labels disagree
  • Sample predictions for periodic human review queue
  • Involve domain experts when retraining on potentially concept-drifted data
  • Consider whether the model's objective function itself needs to change

Instrument Monitoring Before You Deploy

The single most common MLOps mistake is treating monitoring as something to add "after we see a problem." By then, the problem has already affected users, silently, for days or weeks. Monitoring infrastructure β€” logging, dashboards, alert thresholds, runbook β€” should all be defined and tested before the first production deployment. If you can't monitor it, you're not ready to deploy it.

Post-Mortem Template

SectionContent
Incident summaryOne-paragraph description: what failed, when, how long, business impact
TimelineChronological sequence: first alert, triage start, root cause identified, resolution applied, monitoring confirmed stable
Root causeTechnical description of the drift/failure mechanism; supporting evidence (charts, logs)
ImpactNumber of affected users/requests; downstream business metric impact; SLA breach assessment
Detection gapHow long between drift onset and alert? Why wasn't it caught sooner?
Resolution actionsWhat was done: rollback / retrain / data fix / threshold adjustment
Prevention itemsAction items with owners and due dates to prevent recurrence (monitoring improvements, pipeline changes, alerting threshold adjustments)