A/B Testing & Canary Deployments

🌍 Why Test Models in Production

Staging Never Mirrors Production

No matter how carefully you construct a staging environment, it differs from production in ways that matter for model performance.

User demographics and behavior in staging are not representative
Traffic volume differences mean cache behavior, serving latency, and batching dynamics differ
Staging data may be anonymized or sampled in ways that change feature distributions
Real-time feature values (current prices, live inventory) only exist in production
Rare edge cases appear in production at much higher absolute volume

User Behavior Is the Ground Truth

The ultimate measure of a model's value is whether it improves user outcomes or business results. These can only be measured in production with real users making real decisions.

Users respond to model outputs in complex, hard-to-simulate ways
Recommendation model quality = did users click, engage, convert, return?
Search ranking quality = did users find what they were looking for quickly?
Only real user interaction data can measure these outcomes

The AUC-vs-Revenue Gap

Offline metrics (AUC, F1, NDCG) and online business metrics often move in opposite directions. A model with better AUC frequently performs worse in production because it optimizes the wrong objective.

Higher AUC model may show less diverse recommendations → lower overall engagement
Better calibrated probabilities may reduce clickbait → fewer clicks but higher satisfaction
Models optimizing engagement can harm long-term retention
The only way to measure the true objective is production A/B testing

Iterating Safely

Production testing allows rapid iteration while bounding risk. Small traffic percentages limit the blast radius of any single experiment.

Test 5% of traffic → 95% of users are not affected if the model is bad
Collect real signal in days rather than waiting for offline proxy metrics
Fail fast: detect a bad model in hours, not weeks after full deployment
Compound experiments: run multiple model variants simultaneously across user segments

Offline Evaluation Is Necessary But Not Sufficient

Offline evaluation on a held-out test set is essential — it's your first filter for catching clearly inferior models before they reach users. But it is not sufficient on its own. The test set is a static sample of past behavior; it cannot capture how users will respond to the new model's outputs, how the model interacts with product changes made since training data was collected, or whether optimizing your proxy metric actually improves the business objective you care about. Always pair offline evaluation with online testing before full production rollout.

🔬 A/B Testing for ML Models

Experimental Design Basics

A rigorous A/B test requires careful experimental design before collecting a single data point.

Randomization unit: user-level (consistent experience) vs request-level (more samples); choose based on whether model affects session state
Treatment vs control: new model (treatment) vs current production model (control)
Primary metric: define ONE primary success metric before the test; secondary metrics are diagnostic only
Guard rails: metrics that must not degrade (latency SLA, error rate, engagement floor)
Test duration: at least one full week to capture day-of-week effects; ideally two weeks

Statistical Significance

Declare a winner only after achieving pre-specified statistical significance — not when the graph looks good or the numbers move in the right direction.

Significance level α: typically 0.05; reject H0 when p-value < α
Power (1-β): typically 80%; probability of detecting a real effect when it exists
t-test: use for approximately normal continuous metrics (revenue, session duration)
Mann-Whitney U: non-parametric alternative; robust to skewed distributions
Chi-squared: for binary outcomes (click/no-click, convert/no-convert)

Sample Size Calculation

Determine minimum required sample size before starting the test, based on your MDE and baseline metrics.

from statsmodels.stats.power import TTestIndPower

# Parameters
alpha = 0.05       # significance level
power = 0.80       # desired power
baseline = 0.12    # current conversion rate
mde = 0.01         # minimum detectable effect (absolute)

# Effect size (Cohen's d)
effect_size = mde / (baseline * (1 - baseline)) ** 0.5

analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1.0,  # equal group sizes
    alternative='two-sided'
)
print(f"Required n per group: {int(n) + 1}")

Avoiding Common Pitfalls

Peeking: stopping early when the test "looks significant" inflates false positive rate; use sequential testing methods (SPRT, always-valid p-values) if early stopping is needed
Novelty effect: users interact differently with unfamiliar UI; run tests long enough for novelty to wear off (typically 1–2 weeks)
Survivorship bias: if users churn during the test, their absence biases results
Network effects: if users interact with each other, treatment and control are not independent
Multiple testing: testing 20 metrics with α=0.05 means ~1 false positive by chance; apply Bonferroni correction

A/B Test Design by Metric Type

Metric Type	Statistical Test	Minimum Runtime	Common Pitfalls
Conversion rate (binary)	Chi-squared, z-test for proportions	2 weeks (capture weekly cycles)	Novelty effect; peeking; low baseline rates requiring large samples
Revenue (skewed continuous)	Mann-Whitney U; bootstrap; log-transform then t-test	2–4 weeks (high variance)	High variance from large transactions; outlier sensitivity
Session duration (continuous)	t-test (normal after N>30); Mann-Whitney for small samples	1–2 weeks	Zero-inflated (no session = 0); censoring for long sessions
Ranking quality (NDCG)	Paired t-test per query; Wilcoxon signed-rank	1 week (high volume search)	Query selection bias; position bias in click data
Latency (P99)	Quantile comparison; Mann-Whitney	Several days (load patterns)	Traffic distribution changes; hardware variation

🐤 Canary Deployments

Progressive Traffic Routing

A canary deployment sends a small initial slice of traffic to the new model, then progressively increases as confidence builds.

Stage 1 (1–5%): initial canary; watch for serving errors and latency regression
Stage 2 (10–20%): expand if stage 1 is healthy; monitor quality metrics
Stage 3 (50%): if metrics are positive at stage 2; collect statistical power for A/B conclusion
Stage 4 (100%): full promotion after statistical significance confirmed
Automated rollback at any stage if guard rail metrics breach thresholds

Traffic Splitting Infrastructure

Istio (Kubernetes): VirtualService + DestinationRule define traffic weights between service versions; supports header-based routing for internal testing
AWS CodeDeploy: native canary and linear traffic shifting for Lambda and ECS; built-in CloudWatch alarm integration for automated rollback
Kubernetes ingress (NGINX/Traefik): weight-based traffic splitting via ingress annotations
Envoy proxy: fine-grained traffic management; used by most service meshes under the hood

Kubernetes Canary Ingress YAML

# Stable model deployment (currently serving 95% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-stable
  labels:
    app: ml-model
    version: stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: ml-model
      version: stable
  template:
    metadata:
      labels:
        app: ml-model
        version: stable
    spec:
      containers:
        - name: model-server
          image: registry.example.com/ml-model:v1.4.2
          ports:
            - containerPort: 8080
          resources:
            requests: { cpu: "2", memory: "4Gi" }
            limits: { cpu: "4", memory: "8Gi" }

---
# Canary model deployment (receiving 5% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-canary
  labels:
    app: ml-model
    version: canary
spec:
  replicas: 1  # 1 canary pod vs 9 stable pods = ~10% of traffic
  selector:
    matchLabels:
      app: ml-model
      version: canary
  template:
    metadata:
      labels:
        app: ml-model
        version: canary
    spec:
      containers:
        - name: model-server
          image: registry.example.com/ml-model:v1.5.0-rc1
          ports:
            - containerPort: 8080
          resources:
            requests: { cpu: "2", memory: "4Gi" }
            limits: { cpu: "4", memory: "8Gi" }

---
# Service routes to ALL pods (stable + canary)
# Kubernetes naturally load-balances across all matching pods
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model  # matches both stable and canary pods
  ports:
    - port: 80
      targetPort: 8080

---
# Istio-based explicit traffic weight control (alternative to replica-ratio)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ml-model-vs
spec:
  hosts:
    - ml-model-service
  http:
    - route:
        - destination:
            host: ml-model-service
            subset: stable
          weight: 95
        - destination:
            host: ml-model-service
            subset: canary
          weight: 5  # 5% to canary

---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: ml-model-dr
spec:
  host: ml-model-service
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary

👥 Shadow Mode Testing

How Shadow Mode Works

The new model receives a copy of every production request, generates predictions, and logs them — but those predictions are never returned to users. The existing model continues serving all responses normally.

Production proxy duplicates each request to both current model and shadow model
Only the current model's response is returned to the user
Shadow model predictions are logged alongside current model predictions
Both sets of predictions are compared offline to detect divergence, errors, and distribution shifts

When to Use Shadow Mode

High-stakes domains: healthcare, legal, financial decisions where a wrong prediction has severe consequences
Regulated industries: models requiring audit trails and staged validation before any user impact
Major architecture changes: switching model families (tree → neural network) where behavior may differ dramatically
Latency validation: proving new model meets SLA under real production load before it affects users
Cold start: validating a first-ever deployment of a model in a new serving stack

Limitations of Shadow Mode

No real feedback loop: predictions not shown to users → cannot measure clicks, conversions, or engagement outcomes
Double infrastructure cost: both models must handle full production request volume
Latency validation only under observed load: cannot simulate traffic spikes beyond current volume
No novelty effect data: cannot observe how users respond to the new model's specific outputs
State divergence: models that update state (recommendation session context) will diverge over time in shadow mode

Shadow Mode vs Canary vs A/B Test

Dimension	Shadow Mode	Canary Deployment	A/B Test
Risk level	Zero — no user impact	Low — small % of users affected	Medium — 50% of users in treatment
Infrastructure cost	High — 2× compute for full traffic	Low-Medium — small canary fleet	Medium — equal split infrastructure
Feedback speed	None from users (offline comparison only)	Fast — real metric signal within hours to days	Slower — need statistical power (days to weeks)
Business metric measurement	Not possible	Possible but limited (small traffic %)	Primary purpose — statistically rigorous
Best use case	High-stakes validation; regulated industries; first deployments	Gradual rollout with automated rollback	Definitive comparison of two model versions
Rollback ease	N/A — shadow never serves	Instant — remove canary route	Requires stopping experiment and re-routing

🎰 Multi-Armed Bandits as Alternatives

Traditional A/B testing is inherently wasteful: during the experiment, you are knowingly routing traffic to an inferior model variant. Multi-armed bandits adaptively route more traffic to better-performing variants, balancing exploration (learning about models) and exploitation (maximizing outcomes).

Epsilon-Greedy

The simplest bandit algorithm. With probability ε, route to a random variant (explore). With probability 1-ε, route to the current best variant (exploit).

Simple to implement and understand
ε is typically 0.05–0.1 in production
ε-decay: reduce ε over time as confidence grows
Limitation: treats all non-best variants equally; wastes exploration on clearly bad arms

SimpleLow Compute

Thompson Sampling

Maintain a Beta distribution posterior over each variant's true conversion rate. Sample from each posterior; route to the variant with the highest sample. Bayesian and naturally adaptive.

Automatically concentrates exploration on promising variants
Updates posteriors after each observation (fully online)
Mathematically optimal for Bernoulli rewards under certain conditions
Extends to continuous rewards via normal or log-normal priors

BayesianOptimal

Upper Confidence Bound (UCB)

Select the variant with the highest upper confidence bound on its estimated reward. Exploration is driven by uncertainty — variants with less data get a confidence bonus.

UCB1: add bonus term √(2 ln N / n_i) to each variant's mean reward
Deterministic (unlike Thompson sampling) — useful for debugging and auditing
Strong theoretical guarantees on cumulative regret
LinUCB: contextual bandit variant — uses request features to personalize arm selection

DeterministicTheoretical Bounds

Contextual Bandits

Standard bandits ignore request context — they give all users the same traffic allocation. Contextual bandits use features of the request (user segment, device, time of day) to route to the best arm for that specific context.

Different model versions may be better for different user segments
LinUCB and Thompson Sampling both have contextual extensions
Vowpal Wabbit: production-grade contextual bandit library used at major tech companies
Higher implementation complexity but dramatically more efficient

A/B Test vs Multi-Armed Bandit

Dimension	A/B Test	Multi-Armed Bandit
Exploration strategy	Fixed split (50/50); no adaptation during experiment	Adaptive; concentrates traffic on better-performing variants
Time to convergence	Fixed by pre-calculated sample size; typically 1–4 weeks	Faster practical convergence; auto-reduces exploration of bad arms
Opportunity cost	High — 50% on inferior arm throughout the experiment	Low — quickly shifts most traffic to better arm
Statistical guarantees	Classic frequentist guarantees; well-understood	Regret bounds; Bayesian posterior guarantees; less familiar to stakeholders
Implementation complexity	Low — random user assignment, standard tests	Higher — requires online reward collection, posterior updates, routing logic
Best for	Definitive, clean causal inference; regulatory reporting; infrequent decisions	Continuous optimization; many variants; high-velocity decisions; personalization
Explainability	Easy — fixed groups, clear comparison	Harder — traffic allocation changes over time; requires careful logging

When to Choose Bandits Over Fixed A/B

Prefer multi-armed bandits when: (1) you are testing many variants simultaneously (3+) and can't afford to waste traffic on clearly bad ones, (2) the business cost of showing an inferior model is high (e.g., revenue-per-request is large), (3) you need continuous real-time optimization rather than a one-time decision, or (4) the reward signal is immediate (clicks, immediate purchases). Prefer fixed A/B tests when: you need clean causal inference for a regulatory filing, you have exactly two variants, or your stakeholders need to interpret and audit the experiment methodology.