MLOps Series: Model Versioning & Tracking Model Registry & Artifacts CI/CD for ML Monitoring & Drift A/B Testing & Canary Containerization & Orchestration
← Monitoring & Drift Containerization & Orchestration →
⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025
🌍 Why Test Models in Production

Staging Never Mirrors Production

No matter how carefully you construct a staging environment, it differs from production in ways that matter for model performance.

  • User demographics and behavior in staging are not representative
  • Traffic volume differences mean cache behavior, serving latency, and batching dynamics differ
  • Staging data may be anonymized or sampled in ways that change feature distributions
  • Real-time feature values (current prices, live inventory) only exist in production
  • Rare edge cases appear in production at much higher absolute volume

User Behavior Is the Ground Truth

The ultimate measure of a model's value is whether it improves user outcomes or business results. These can only be measured in production with real users making real decisions.

  • Users respond to model outputs in complex, hard-to-simulate ways
  • Recommendation model quality = did users click, engage, convert, return?
  • Search ranking quality = did users find what they were looking for quickly?
  • Only real user interaction data can measure these outcomes

The AUC-vs-Revenue Gap

Offline metrics (AUC, F1, NDCG) and online business metrics often move in opposite directions. A model with better AUC frequently performs worse in production because it optimizes the wrong objective.

  • Higher AUC model may show less diverse recommendations → lower overall engagement
  • Better calibrated probabilities may reduce clickbait → fewer clicks but higher satisfaction
  • Models optimizing engagement can harm long-term retention
  • The only way to measure the true objective is production A/B testing

Iterating Safely

Production testing allows rapid iteration while bounding risk. Small traffic percentages limit the blast radius of any single experiment.

  • Test 5% of traffic → 95% of users are not affected if the model is bad
  • Collect real signal in days rather than waiting for offline proxy metrics
  • Fail fast: detect a bad model in hours, not weeks after full deployment
  • Compound experiments: run multiple model variants simultaneously across user segments

Offline Evaluation Is Necessary But Not Sufficient

Offline evaluation on a held-out test set is essential — it's your first filter for catching clearly inferior models before they reach users. But it is not sufficient on its own. The test set is a static sample of past behavior; it cannot capture how users will respond to the new model's outputs, how the model interacts with product changes made since training data was collected, or whether optimizing your proxy metric actually improves the business objective you care about. Always pair offline evaluation with online testing before full production rollout.

🔬 A/B Testing for ML Models

Experimental Design Basics

A rigorous A/B test requires careful experimental design before collecting a single data point.

  • Randomization unit: user-level (consistent experience) vs request-level (more samples); choose based on whether model affects session state
  • Treatment vs control: new model (treatment) vs current production model (control)
  • Primary metric: define ONE primary success metric before the test; secondary metrics are diagnostic only
  • Guard rails: metrics that must not degrade (latency SLA, error rate, engagement floor)
  • Test duration: at least one full week to capture day-of-week effects; ideally two weeks

Statistical Significance

Declare a winner only after achieving pre-specified statistical significance — not when the graph looks good or the numbers move in the right direction.

  • Significance level α: typically 0.05; reject H0 when p-value < α
  • Power (1-β): typically 80%; probability of detecting a real effect when it exists
  • t-test: use for approximately normal continuous metrics (revenue, session duration)
  • Mann-Whitney U: non-parametric alternative; robust to skewed distributions
  • Chi-squared: for binary outcomes (click/no-click, convert/no-convert)

Sample Size Calculation

Determine minimum required sample size before starting the test, based on your MDE and baseline metrics.

from statsmodels.stats.power import TTestIndPower

# Parameters
alpha = 0.05       # significance level
power = 0.80       # desired power
baseline = 0.12    # current conversion rate
mde = 0.01         # minimum detectable effect (absolute)

# Effect size (Cohen's d)
effect_size = mde / (baseline * (1 - baseline)) ** 0.5

analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1.0,  # equal group sizes
    alternative='two-sided'
)
print(f"Required n per group: {int(n) + 1}")

Avoiding Common Pitfalls

  • Peeking: stopping early when the test "looks significant" inflates false positive rate; use sequential testing methods (SPRT, always-valid p-values) if early stopping is needed
  • Novelty effect: users interact differently with unfamiliar UI; run tests long enough for novelty to wear off (typically 1–2 weeks)
  • Survivorship bias: if users churn during the test, their absence biases results
  • Network effects: if users interact with each other, treatment and control are not independent
  • Multiple testing: testing 20 metrics with α=0.05 means ~1 false positive by chance; apply Bonferroni correction

A/B Test Design by Metric Type

Metric TypeStatistical TestMinimum RuntimeCommon Pitfalls
Conversion rate (binary)Chi-squared, z-test for proportions2 weeks (capture weekly cycles)Novelty effect; peeking; low baseline rates requiring large samples
Revenue (skewed continuous)Mann-Whitney U; bootstrap; log-transform then t-test2–4 weeks (high variance)High variance from large transactions; outlier sensitivity
Session duration (continuous)t-test (normal after N>30); Mann-Whitney for small samples1–2 weeksZero-inflated (no session = 0); censoring for long sessions
Ranking quality (NDCG)Paired t-test per query; Wilcoxon signed-rank1 week (high volume search)Query selection bias; position bias in click data
Latency (P99)Quantile comparison; Mann-WhitneySeveral days (load patterns)Traffic distribution changes; hardware variation
🐤 Canary Deployments

Progressive Traffic Routing

A canary deployment sends a small initial slice of traffic to the new model, then progressively increases as confidence builds.

  • Stage 1 (1–5%): initial canary; watch for serving errors and latency regression
  • Stage 2 (10–20%): expand if stage 1 is healthy; monitor quality metrics
  • Stage 3 (50%): if metrics are positive at stage 2; collect statistical power for A/B conclusion
  • Stage 4 (100%): full promotion after statistical significance confirmed
  • Automated rollback at any stage if guard rail metrics breach thresholds

Traffic Splitting Infrastructure

  • Istio (Kubernetes): VirtualService + DestinationRule define traffic weights between service versions; supports header-based routing for internal testing
  • AWS CodeDeploy: native canary and linear traffic shifting for Lambda and ECS; built-in CloudWatch alarm integration for automated rollback
  • Kubernetes ingress (NGINX/Traefik): weight-based traffic splitting via ingress annotations
  • Envoy proxy: fine-grained traffic management; used by most service meshes under the hood

Kubernetes Canary Ingress YAML

# Stable model deployment (currently serving 95% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-stable
  labels:
    app: ml-model
    version: stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: ml-model
      version: stable
  template:
    metadata:
      labels:
        app: ml-model
        version: stable
    spec:
      containers:
        - name: model-server
          image: registry.example.com/ml-model:v1.4.2
          ports:
            - containerPort: 8080
          resources:
            requests: { cpu: "2", memory: "4Gi" }
            limits: { cpu: "4", memory: "8Gi" }

---
# Canary model deployment (receiving 5% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-canary
  labels:
    app: ml-model
    version: canary
spec:
  replicas: 1  # 1 canary pod vs 9 stable pods = ~10% of traffic
  selector:
    matchLabels:
      app: ml-model
      version: canary
  template:
    metadata:
      labels:
        app: ml-model
        version: canary
    spec:
      containers:
        - name: model-server
          image: registry.example.com/ml-model:v1.5.0-rc1
          ports:
            - containerPort: 8080
          resources:
            requests: { cpu: "2", memory: "4Gi" }
            limits: { cpu: "4", memory: "8Gi" }

---
# Service routes to ALL pods (stable + canary)
# Kubernetes naturally load-balances across all matching pods
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model  # matches both stable and canary pods
  ports:
    - port: 80
      targetPort: 8080

---
# Istio-based explicit traffic weight control (alternative to replica-ratio)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ml-model-vs
spec:
  hosts:
    - ml-model-service
  http:
    - route:
        - destination:
            host: ml-model-service
            subset: stable
          weight: 95
        - destination:
            host: ml-model-service
            subset: canary
          weight: 5  # 5% to canary

---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: ml-model-dr
spec:
  host: ml-model-service
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary
👥 Shadow Mode Testing

How Shadow Mode Works

The new model receives a copy of every production request, generates predictions, and logs them — but those predictions are never returned to users. The existing model continues serving all responses normally.

  • Production proxy duplicates each request to both current model and shadow model
  • Only the current model's response is returned to the user
  • Shadow model predictions are logged alongside current model predictions
  • Both sets of predictions are compared offline to detect divergence, errors, and distribution shifts

When to Use Shadow Mode

  • High-stakes domains: healthcare, legal, financial decisions where a wrong prediction has severe consequences
  • Regulated industries: models requiring audit trails and staged validation before any user impact
  • Major architecture changes: switching model families (tree → neural network) where behavior may differ dramatically
  • Latency validation: proving new model meets SLA under real production load before it affects users
  • Cold start: validating a first-ever deployment of a model in a new serving stack

Limitations of Shadow Mode

  • No real feedback loop: predictions not shown to users → cannot measure clicks, conversions, or engagement outcomes
  • Double infrastructure cost: both models must handle full production request volume
  • Latency validation only under observed load: cannot simulate traffic spikes beyond current volume
  • No novelty effect data: cannot observe how users respond to the new model's specific outputs
  • State divergence: models that update state (recommendation session context) will diverge over time in shadow mode

Shadow Mode vs Canary vs A/B Test

DimensionShadow ModeCanary DeploymentA/B Test
Risk levelZero — no user impactLow — small % of users affectedMedium — 50% of users in treatment
Infrastructure costHigh — 2× compute for full trafficLow-Medium — small canary fleetMedium — equal split infrastructure
Feedback speedNone from users (offline comparison only)Fast — real metric signal within hours to daysSlower — need statistical power (days to weeks)
Business metric measurementNot possiblePossible but limited (small traffic %)Primary purpose — statistically rigorous
Best use caseHigh-stakes validation; regulated industries; first deploymentsGradual rollout with automated rollbackDefinitive comparison of two model versions
Rollback easeN/A — shadow never servesInstant — remove canary routeRequires stopping experiment and re-routing
🎰 Multi-Armed Bandits as Alternatives

Traditional A/B testing is inherently wasteful: during the experiment, you are knowingly routing traffic to an inferior model variant. Multi-armed bandits adaptively route more traffic to better-performing variants, balancing exploration (learning about models) and exploitation (maximizing outcomes).

Epsilon-Greedy

The simplest bandit algorithm. With probability ε, route to a random variant (explore). With probability 1-ε, route to the current best variant (exploit).

  • Simple to implement and understand
  • ε is typically 0.05–0.1 in production
  • ε-decay: reduce ε over time as confidence grows
  • Limitation: treats all non-best variants equally; wastes exploration on clearly bad arms
SimpleLow Compute

Thompson Sampling

Maintain a Beta distribution posterior over each variant's true conversion rate. Sample from each posterior; route to the variant with the highest sample. Bayesian and naturally adaptive.

  • Automatically concentrates exploration on promising variants
  • Updates posteriors after each observation (fully online)
  • Mathematically optimal for Bernoulli rewards under certain conditions
  • Extends to continuous rewards via normal or log-normal priors
BayesianOptimal

Upper Confidence Bound (UCB)

Select the variant with the highest upper confidence bound on its estimated reward. Exploration is driven by uncertainty — variants with less data get a confidence bonus.

  • UCB1: add bonus term √(2 ln N / n_i) to each variant's mean reward
  • Deterministic (unlike Thompson sampling) — useful for debugging and auditing
  • Strong theoretical guarantees on cumulative regret
  • LinUCB: contextual bandit variant — uses request features to personalize arm selection
DeterministicTheoretical Bounds

Contextual Bandits

Standard bandits ignore request context — they give all users the same traffic allocation. Contextual bandits use features of the request (user segment, device, time of day) to route to the best arm for that specific context.

  • Different model versions may be better for different user segments
  • LinUCB and Thompson Sampling both have contextual extensions
  • Vowpal Wabbit: production-grade contextual bandit library used at major tech companies
  • Higher implementation complexity but dramatically more efficient

A/B Test vs Multi-Armed Bandit

DimensionA/B TestMulti-Armed Bandit
Exploration strategyFixed split (50/50); no adaptation during experimentAdaptive; concentrates traffic on better-performing variants
Time to convergenceFixed by pre-calculated sample size; typically 1–4 weeksFaster practical convergence; auto-reduces exploration of bad arms
Opportunity costHigh — 50% on inferior arm throughout the experimentLow — quickly shifts most traffic to better arm
Statistical guaranteesClassic frequentist guarantees; well-understoodRegret bounds; Bayesian posterior guarantees; less familiar to stakeholders
Implementation complexityLow — random user assignment, standard testsHigher — requires online reward collection, posterior updates, routing logic
Best forDefinitive, clean causal inference; regulatory reporting; infrequent decisionsContinuous optimization; many variants; high-velocity decisions; personalization
ExplainabilityEasy — fixed groups, clear comparisonHarder — traffic allocation changes over time; requires careful logging

When to Choose Bandits Over Fixed A/B

Prefer multi-armed bandits when: (1) you are testing many variants simultaneously (3+) and can't afford to waste traffic on clearly bad ones, (2) the business cost of showing an inferior model is high (e.g., revenue-per-request is large), (3) you need continuous real-time optimization rather than a one-time decision, or (4) the reward signal is immediate (clicks, immediate purchases). Prefer fixed A/B tests when: you need clean causal inference for a regulatory filing, you have exactly two variants, or your stakeholders need to interpret and audit the experiment methodology.