Staging Never Mirrors Production
No matter how carefully you construct a staging environment, it differs from production in ways that matter for model performance.
- User demographics and behavior in staging are not representative
- Traffic volume differences mean cache behavior, serving latency, and batching dynamics differ
- Staging data may be anonymized or sampled in ways that change feature distributions
- Real-time feature values (current prices, live inventory) only exist in production
- Rare edge cases appear in production at much higher absolute volume
User Behavior Is the Ground Truth
The ultimate measure of a model's value is whether it improves user outcomes or business results. These can only be measured in production with real users making real decisions.
- Users respond to model outputs in complex, hard-to-simulate ways
- Recommendation model quality = did users click, engage, convert, return?
- Search ranking quality = did users find what they were looking for quickly?
- Only real user interaction data can measure these outcomes
The AUC-vs-Revenue Gap
Offline metrics (AUC, F1, NDCG) and online business metrics often move in opposite directions. A model with better AUC frequently performs worse in production because it optimizes the wrong objective.
- Higher AUC model may show less diverse recommendations → lower overall engagement
- Better calibrated probabilities may reduce clickbait → fewer clicks but higher satisfaction
- Models optimizing engagement can harm long-term retention
- The only way to measure the true objective is production A/B testing
Iterating Safely
Production testing allows rapid iteration while bounding risk. Small traffic percentages limit the blast radius of any single experiment.
- Test 5% of traffic → 95% of users are not affected if the model is bad
- Collect real signal in days rather than waiting for offline proxy metrics
- Fail fast: detect a bad model in hours, not weeks after full deployment
- Compound experiments: run multiple model variants simultaneously across user segments
Offline Evaluation Is Necessary But Not Sufficient
Offline evaluation on a held-out test set is essential — it's your first filter for catching clearly inferior models before they reach users. But it is not sufficient on its own. The test set is a static sample of past behavior; it cannot capture how users will respond to the new model's outputs, how the model interacts with product changes made since training data was collected, or whether optimizing your proxy metric actually improves the business objective you care about. Always pair offline evaluation with online testing before full production rollout.
Experimental Design Basics
A rigorous A/B test requires careful experimental design before collecting a single data point.
- Randomization unit: user-level (consistent experience) vs request-level (more samples); choose based on whether model affects session state
- Treatment vs control: new model (treatment) vs current production model (control)
- Primary metric: define ONE primary success metric before the test; secondary metrics are diagnostic only
- Guard rails: metrics that must not degrade (latency SLA, error rate, engagement floor)
- Test duration: at least one full week to capture day-of-week effects; ideally two weeks
Statistical Significance
Declare a winner only after achieving pre-specified statistical significance — not when the graph looks good or the numbers move in the right direction.
- Significance level α: typically 0.05; reject H0 when p-value < α
- Power (1-β): typically 80%; probability of detecting a real effect when it exists
- t-test: use for approximately normal continuous metrics (revenue, session duration)
- Mann-Whitney U: non-parametric alternative; robust to skewed distributions
- Chi-squared: for binary outcomes (click/no-click, convert/no-convert)
Sample Size Calculation
Determine minimum required sample size before starting the test, based on your MDE and baseline metrics.
from statsmodels.stats.power import TTestIndPower
# Parameters
alpha = 0.05 # significance level
power = 0.80 # desired power
baseline = 0.12 # current conversion rate
mde = 0.01 # minimum detectable effect (absolute)
# Effect size (Cohen's d)
effect_size = mde / (baseline * (1 - baseline)) ** 0.5
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
ratio=1.0, # equal group sizes
alternative='two-sided'
)
print(f"Required n per group: {int(n) + 1}")
Avoiding Common Pitfalls
- Peeking: stopping early when the test "looks significant" inflates false positive rate; use sequential testing methods (SPRT, always-valid p-values) if early stopping is needed
- Novelty effect: users interact differently with unfamiliar UI; run tests long enough for novelty to wear off (typically 1–2 weeks)
- Survivorship bias: if users churn during the test, their absence biases results
- Network effects: if users interact with each other, treatment and control are not independent
- Multiple testing: testing 20 metrics with α=0.05 means ~1 false positive by chance; apply Bonferroni correction
A/B Test Design by Metric Type
| Metric Type | Statistical Test | Minimum Runtime | Common Pitfalls |
|---|---|---|---|
| Conversion rate (binary) | Chi-squared, z-test for proportions | 2 weeks (capture weekly cycles) | Novelty effect; peeking; low baseline rates requiring large samples |
| Revenue (skewed continuous) | Mann-Whitney U; bootstrap; log-transform then t-test | 2–4 weeks (high variance) | High variance from large transactions; outlier sensitivity |
| Session duration (continuous) | t-test (normal after N>30); Mann-Whitney for small samples | 1–2 weeks | Zero-inflated (no session = 0); censoring for long sessions |
| Ranking quality (NDCG) | Paired t-test per query; Wilcoxon signed-rank | 1 week (high volume search) | Query selection bias; position bias in click data |
| Latency (P99) | Quantile comparison; Mann-Whitney | Several days (load patterns) | Traffic distribution changes; hardware variation |
Progressive Traffic Routing
A canary deployment sends a small initial slice of traffic to the new model, then progressively increases as confidence builds.
- Stage 1 (1–5%): initial canary; watch for serving errors and latency regression
- Stage 2 (10–20%): expand if stage 1 is healthy; monitor quality metrics
- Stage 3 (50%): if metrics are positive at stage 2; collect statistical power for A/B conclusion
- Stage 4 (100%): full promotion after statistical significance confirmed
- Automated rollback at any stage if guard rail metrics breach thresholds
Traffic Splitting Infrastructure
- Istio (Kubernetes): VirtualService + DestinationRule define traffic weights between service versions; supports header-based routing for internal testing
- AWS CodeDeploy: native canary and linear traffic shifting for Lambda and ECS; built-in CloudWatch alarm integration for automated rollback
- Kubernetes ingress (NGINX/Traefik): weight-based traffic splitting via ingress annotations
- Envoy proxy: fine-grained traffic management; used by most service meshes under the hood
Kubernetes Canary Ingress YAML
# Stable model deployment (currently serving 95% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-stable
labels:
app: ml-model
version: stable
spec:
replicas: 9
selector:
matchLabels:
app: ml-model
version: stable
template:
metadata:
labels:
app: ml-model
version: stable
spec:
containers:
- name: model-server
image: registry.example.com/ml-model:v1.4.2
ports:
- containerPort: 8080
resources:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
---
# Canary model deployment (receiving 5% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-canary
labels:
app: ml-model
version: canary
spec:
replicas: 1 # 1 canary pod vs 9 stable pods = ~10% of traffic
selector:
matchLabels:
app: ml-model
version: canary
template:
metadata:
labels:
app: ml-model
version: canary
spec:
containers:
- name: model-server
image: registry.example.com/ml-model:v1.5.0-rc1
ports:
- containerPort: 8080
resources:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
---
# Service routes to ALL pods (stable + canary)
# Kubernetes naturally load-balances across all matching pods
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model # matches both stable and canary pods
ports:
- port: 80
targetPort: 8080
---
# Istio-based explicit traffic weight control (alternative to replica-ratio)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ml-model-vs
spec:
hosts:
- ml-model-service
http:
- route:
- destination:
host: ml-model-service
subset: stable
weight: 95
- destination:
host: ml-model-service
subset: canary
weight: 5 # 5% to canary
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: ml-model-dr
spec:
host: ml-model-service
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
How Shadow Mode Works
The new model receives a copy of every production request, generates predictions, and logs them — but those predictions are never returned to users. The existing model continues serving all responses normally.
- Production proxy duplicates each request to both current model and shadow model
- Only the current model's response is returned to the user
- Shadow model predictions are logged alongside current model predictions
- Both sets of predictions are compared offline to detect divergence, errors, and distribution shifts
When to Use Shadow Mode
- High-stakes domains: healthcare, legal, financial decisions where a wrong prediction has severe consequences
- Regulated industries: models requiring audit trails and staged validation before any user impact
- Major architecture changes: switching model families (tree → neural network) where behavior may differ dramatically
- Latency validation: proving new model meets SLA under real production load before it affects users
- Cold start: validating a first-ever deployment of a model in a new serving stack
Limitations of Shadow Mode
- No real feedback loop: predictions not shown to users → cannot measure clicks, conversions, or engagement outcomes
- Double infrastructure cost: both models must handle full production request volume
- Latency validation only under observed load: cannot simulate traffic spikes beyond current volume
- No novelty effect data: cannot observe how users respond to the new model's specific outputs
- State divergence: models that update state (recommendation session context) will diverge over time in shadow mode
Shadow Mode vs Canary vs A/B Test
| Dimension | Shadow Mode | Canary Deployment | A/B Test |
|---|---|---|---|
| Risk level | Zero — no user impact | Low — small % of users affected | Medium — 50% of users in treatment |
| Infrastructure cost | High — 2× compute for full traffic | Low-Medium — small canary fleet | Medium — equal split infrastructure |
| Feedback speed | None from users (offline comparison only) | Fast — real metric signal within hours to days | Slower — need statistical power (days to weeks) |
| Business metric measurement | Not possible | Possible but limited (small traffic %) | Primary purpose — statistically rigorous |
| Best use case | High-stakes validation; regulated industries; first deployments | Gradual rollout with automated rollback | Definitive comparison of two model versions |
| Rollback ease | N/A — shadow never serves | Instant — remove canary route | Requires stopping experiment and re-routing |
Traditional A/B testing is inherently wasteful: during the experiment, you are knowingly routing traffic to an inferior model variant. Multi-armed bandits adaptively route more traffic to better-performing variants, balancing exploration (learning about models) and exploitation (maximizing outcomes).
Epsilon-Greedy
The simplest bandit algorithm. With probability ε, route to a random variant (explore). With probability 1-ε, route to the current best variant (exploit).
- Simple to implement and understand
- ε is typically 0.05–0.1 in production
- ε-decay: reduce ε over time as confidence grows
- Limitation: treats all non-best variants equally; wastes exploration on clearly bad arms
Thompson Sampling
Maintain a Beta distribution posterior over each variant's true conversion rate. Sample from each posterior; route to the variant with the highest sample. Bayesian and naturally adaptive.
- Automatically concentrates exploration on promising variants
- Updates posteriors after each observation (fully online)
- Mathematically optimal for Bernoulli rewards under certain conditions
- Extends to continuous rewards via normal or log-normal priors
Upper Confidence Bound (UCB)
Select the variant with the highest upper confidence bound on its estimated reward. Exploration is driven by uncertainty — variants with less data get a confidence bonus.
- UCB1: add bonus term
√(2 ln N / n_i)to each variant's mean reward - Deterministic (unlike Thompson sampling) — useful for debugging and auditing
- Strong theoretical guarantees on cumulative regret
- LinUCB: contextual bandit variant — uses request features to personalize arm selection
Contextual Bandits
Standard bandits ignore request context — they give all users the same traffic allocation. Contextual bandits use features of the request (user segment, device, time of day) to route to the best arm for that specific context.
- Different model versions may be better for different user segments
- LinUCB and Thompson Sampling both have contextual extensions
- Vowpal Wabbit: production-grade contextual bandit library used at major tech companies
- Higher implementation complexity but dramatically more efficient
A/B Test vs Multi-Armed Bandit
| Dimension | A/B Test | Multi-Armed Bandit |
|---|---|---|
| Exploration strategy | Fixed split (50/50); no adaptation during experiment | Adaptive; concentrates traffic on better-performing variants |
| Time to convergence | Fixed by pre-calculated sample size; typically 1–4 weeks | Faster practical convergence; auto-reduces exploration of bad arms |
| Opportunity cost | High — 50% on inferior arm throughout the experiment | Low — quickly shifts most traffic to better arm |
| Statistical guarantees | Classic frequentist guarantees; well-understood | Regret bounds; Bayesian posterior guarantees; less familiar to stakeholders |
| Implementation complexity | Low — random user assignment, standard tests | Higher — requires online reward collection, posterior updates, routing logic |
| Best for | Definitive, clean causal inference; regulatory reporting; infrequent decisions | Continuous optimization; many variants; high-velocity decisions; personalization |
| Explainability | Easy — fixed groups, clear comparison | Harder — traffic allocation changes over time; requires careful logging |
When to Choose Bandits Over Fixed A/B
Prefer multi-armed bandits when: (1) you are testing many variants simultaneously (3+) and can't afford to waste traffic on clearly bad ones, (2) the business cost of showing an inferior model is high (e.g., revenue-per-request is large), (3) you need continuous real-time optimization rather than a one-time decision, or (4) the reward signal is immediate (clicks, immediate purchases). Prefer fixed A/B tests when: you need clean causal inference for a regulatory filing, you have exactly two variants, or your stakeholders need to interpret and audit the experiment methodology.