Traditional software CI/CD is well-understood: commit code, run tests, build artifact, deploy. ML CI/CD has all of this complexity plus a second axis of change โ the data โ that makes automation significantly more challenging.
Two Sources of Change
In software, only code triggers a new deployment. In ML, both code and data can independently invalidate a deployed model.
- Code changes: new model architecture, updated feature engineering, changed hyperparameters
- Data changes: new training data arrives, upstream schema changes, distribution shift detected
- Scheduled retraining: time-based triggers regardless of code/data changes
- Both triggers must eventually produce the same pipeline outcome: a validated model in production
Tests Are Fundamentally Different
Software CI/CD tests check behavioral correctness (unit tests, integration tests). ML CI/CD must also test data quality and model quality โ both of which are probabilistic and threshold-based.
- Data validation: schema checks, distribution checks, no-null assertions
- Model quality gates: accuracy, F1, AUC must exceed a threshold to proceed
- Regression tests: new model must not regress on a golden evaluation set
- Bias and fairness checks: performance across demographic subgroups
- Latency tests: inference must complete within SLA constraints
Artifacts Are Model Weights, Not Binaries
Software CI/CD produces a Docker image or binary. ML CI/CD produces model weights โ potentially gigabytes in size โ that must be versioned, stored, and traced back to the exact data and code that produced them.
- Model artifacts include: weights file, preprocessing pipeline, feature schema, training metadata
- Must be stored in a model registry, not a code repository
- Full lineage required: which data version + code version + hyperparameters produced this model
- Rollback means re-deploying a previous model artifact, not reverting code
Rollback Complexity
In software, rollback means reverting to the previous code version. In ML, rollback means re-deploying a previous model version โ which may have been trained on data that is no longer representative of current reality.
- Previous model version may be "safe" but also "wrong" for current data distribution
- Rollback should be automatic when online metrics degrade beyond threshold
- Rollback SLA (how fast can you revert?) is a key system design requirement
- Shadow mode testing reduces rollback frequency by validating before full deployment
ML CI/CD vs Software CI/CD Comparison
| Dimension | Software CI/CD | ML CI/CD |
|---|---|---|
| Triggers | Code commit / PR merge | Code commit, data update, scheduled retraining, drift alert |
| Tests | Unit tests, integration tests, linting | Data validation, model quality gates, regression on golden set, bias checks, latency tests |
| Primary artifact | Docker image, compiled binary | Model weights + preprocessing pipeline + metadata package |
| Artifact storage | Container registry (ECR, GCR) | Model registry (MLflow, SageMaker, Vertex AI Model Registry) |
| Rollback mechanism | Redeploy previous image tag | Reload previous model version from registry; may need A/B shadow period |
| Pipeline latency | Minutes (build + test) | Hours to days (data validation + training + evaluation) |
| Reproducibility requirement | Deterministic builds | Same data + code + seeds โ same model (requires full lineage tracking) |
A production ML pipeline is a directed acyclic graph (DAG) of steps, each of which can be independently versioned, cached, and re-run. Getting the stage boundaries right is critical for efficient iteration.
Stage 1: Data Validation
- Schema check: column names, dtypes, expected ranges match specification
- Distribution check: feature statistics within expected bounds vs baseline
- Completeness: no-null assertions on required columns
- Row count: sanity check that data volume is in expected range
- Reject batch and alert on failure โ do not proceed to training
Stage 2: Feature Engineering
- Apply transforms (scaling, encoding, imputation) fitted on training set only
- Save the fitted preprocessing pipeline as an artifact
- Validate output feature schema before training begins
- Log feature statistics for drift comparison at serving time
Stage 3: Training
- Fix all random seeds for reproducibility
- Log hyperparameters, metrics per epoch, and final metrics to experiment tracker
- Checkpoint model periodically (resume from checkpoint on failure)
- Train on training split only; validation split for early stopping only
Stage 4: Evaluation & Quality Gates
- Evaluate on held-out test set with all quality gate metrics
- Compare against current production model (champion/challenger)
- Run bias/fairness evaluation across demographic slices
- Run latency profiling: P50, P95, P99 inference times
- Block registration if any gate fails; notify on-call team
Stage 5: Registration
- Push model + preprocessing pipeline + metadata to model registry
- Tag with training run ID, data version, Git SHA, evaluation metrics
- Set model stage to "Staging" (requires separate promotion to Production)
- Trigger integration test suite against staging serving endpoint
Stage 6: Deployment
- Automated deployment to staging environment on registry push
- Manual approval gate before production deployment (for high-stakes models)
- Blue/green or canary rollout strategy
- Update monitoring dashboards with new model version baseline metrics
ML Pipeline Orchestration Tools
| Tool | Strengths | Hosted Option | Learning Curve |
|---|---|---|---|
| Kubeflow Pipelines | Kubernetes-native; containerized steps; strong lineage; scalable | Google Vertex AI Pipelines | High โ requires Kubernetes familiarity |
| MLflow Pipelines | Tight MLflow integration; opinionated template structure; simple YAML config | Databricks (managed MLflow) | Low โ familiar to MLflow users |
| Metaflow | Python-native; Netflix-proven; great for data scientists; S3/cloud step isolation | Outerbounds (managed Metaflow) | Low-Medium โ pure Python decorators |
| Apache Airflow | Battle-tested; huge operator ecosystem; general-purpose DAG scheduling | Astronomer, Cloud Composer, MWAA | Medium โ DAG authoring in Python |
| Prefect | Modern Python-first; excellent observability; easier than Airflow for ML teams | Prefect Cloud | Low โ minimal boilerplate |
| ZenML | Framework-agnostic; stack-based deployment abstraction; stack recipes | ZenML Cloud | Low-Medium |
Data Tests with Great Expectations
Great Expectations (GX) provides a declarative framework for defining data quality rules as "expectations" that run as part of the pipeline.
- Schema expectations: column exists, column type matches, no unexpected columns
- Distribution expectations: mean between X and Y, std below Z, min/max within range
- Completeness: null percentage below threshold for required columns
- Uniqueness: primary key column has no duplicates
- Produces HTML Data Docs reports โ shareable with non-technical stakeholders
Data Tests with Pandera
Pandera provides a pandas-native schema validation library with a Pythonic API โ simpler than GX for straightforward schema checks.
- Define schema as a Python class (DataFrameSchema or SchemaModel)
- Column-level checks: dtype, nullable, unique, value ranges, regex patterns
- Row-level checks: cross-column validation rules
- Integrates with pandas, polars, and modin DataFrames
- Raises SchemaError with descriptive message on validation failure
Model Quality Tests
Model tests verify that a freshly trained model meets minimum performance standards before registration.
- Performance thresholds: AUC > 0.85, precision > 0.80, F1 > 0.78 โ defined before training begins
- Champion/challenger comparison: new model must match or exceed current production model's metrics
- Golden dataset regression: fixed labeled set โ new model must get the same predictions on canonical examples
- Slice evaluation: check performance on demographic subgroups, rare classes, edge cases
Smoke Tests for Serving
Before a model goes live, smoke tests validate that the serving infrastructure works end-to-end with a small set of known inputs.
- Send 10โ20 canonical requests and verify response schema, latency, and prediction format
- Test edge cases: empty input, missing features, maximum payload size
- Verify that preprocessing pipeline is correctly applied at serving time
- Check that model version in response headers matches expected deployment
Great Expectations Checkpoint YAML Skeleton
# great_expectations/checkpoints/training_data_checkpoint.yml
name: training_data_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "training_data_%Y%m%d_%H%M%S"
validations:
- batch_request:
datasource_name: training_data_source
data_connector_name: default_inferred_data_connector_name
data_asset_name: features.parquet
data_connector_query:
index: -1 # most recent batch
expectation_suite_name: training_features.warning
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
# Fail pipeline if ANY expectation fails
- name: send_slack_notification_on_failure
action:
class_name: SlackNotificationAction
slack_webhook: ${SLACK_WEBHOOK_URL}
notify_on: failure
---
# Expectation suite definition (abbreviated)
# great_expectations/expectations/training_features.warning.json
{
"expectation_suite_name": "training_features.warning",
"expectations": [
{
"expectation_type": "expect_column_to_exist",
"kwargs": { "column": "user_id" }
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": { "column": "label", "mostly": 1.0 }
},
{
"expectation_type": "expect_column_mean_to_be_between",
"kwargs": { "column": "feature_score", "min_value": 0.3, "max_value": 0.7 }
},
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": { "min_value": 10000, "max_value": 5000000 }
}
]
}
Trigger Strategies
- push / pull_request: trigger on code changes to model or pipeline files
- schedule: cron-based retraining (e.g., weekly retrain with latest data)
- workflow_dispatch: manual trigger with input parameters (data version, hyperparams)
- repository_dispatch: triggered by external system (data pipeline webhook, drift alert)
- Use path filters to only trigger on relevant file changes (avoid retraining on README edits)
Dependency Caching
- Cache pip dependencies using
actions/cachekeyed onrequirements.txthash - Cache downloaded datasets (if small enough) between runs
- Use DVC remote cache to avoid re-processing unchanged data stages
- Docker layer caching: push base image to registry and pull in subsequent runs
GPU Runners
- GitHub-hosted runners are CPU-only; GPU training requires self-hosted or third-party
- Self-hosted runners: register your GPU server as a GitHub Actions runner
- RunsOn / Ubicloud / Buildjet: managed GPU runner providers for GitHub Actions
- For large training jobs, trigger cloud training (SageMaker, Vertex AI) from the action and poll for completion
Secrets & Environment Management
- Store cloud credentials, API keys, and registry tokens in GitHub Secrets
- Use environment protection rules to require review before production deployments
- OIDC federation: get short-lived cloud credentials without storing long-lived secrets
- Use GitHub Environments to separate staging and production credentials cleanly
GitHub Actions ML Pipeline Workflow
# .github/workflows/train.yml
name: ML Training Pipeline
on:
push:
branches: [main]
paths:
- 'src/**'
- 'configs/**'
- 'requirements.txt'
schedule:
- cron: '0 2 * * 0' # Every Sunday at 2am UTC
workflow_dispatch:
inputs:
data_version:
description: 'DVC data version tag'
required: false
default: 'latest'
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
AWS_DEFAULT_REGION: us-east-1
jobs:
# โโ Job 1: Validate data โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Cache pip dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
- name: Install dependencies
run: pip install -r requirements.txt
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Pull data with DVC
run: |
dvc pull data/training/
echo "Data pulled successfully"
- name: Run Great Expectations data validation
run: |
python -m great_expectations checkpoint run training_data_checkpoint
env:
GE_HOME: ./great_expectations
# โโ Job 2: Train model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
train:
needs: validate-data
runs-on: [self-hosted, gpu] # GPU self-hosted runner
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install -r requirements.txt
- name: Train model
run: |
python src/train.py \
--config configs/train_config.yaml \
--data-version ${{ github.event.inputs.data_version || 'latest' }} \
--run-name "ci-${{ github.sha }}"
env:
MLFLOW_EXPERIMENT_NAME: production-model
- name: Export run ID
id: run_info
run: echo "run_id=$(cat run_id.txt)" >> $GITHUB_OUTPUT
outputs:
run_id: ${{ steps.run_info.outputs.run_id }}
# โโ Job 3: Evaluate and quality gate โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
evaluate:
needs: train
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation with quality gates
run: |
python src/evaluate.py \
--run-id ${{ needs.train.outputs.run_id }} \
--min-auc 0.85 \
--min-f1 0.78 \
--compare-champion true
- name: Run bias evaluation
run: |
python src/bias_eval.py \
--run-id ${{ needs.train.outputs.run_id }} \
--max-demographic-gap 0.05
# โโ Job 4: Register model (only on main branch merge) โโโโโโโโโโโโ
register:
needs: evaluate
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: staging # requires environment approval
steps:
- uses: actions/checkout@v4
- name: Register model to MLflow Registry
run: |
python src/register_model.py \
--run-id ${{ needs.train.outputs.run_id }} \
--model-name production-classifier \
--stage Staging
- name: Run smoke tests against staging endpoint
run: python tests/smoke_test_serving.py --env staging
Blue/Green Deployment
Run two identical serving environments โ "blue" (current) and "green" (new). When green passes all tests, flip 100% of traffic to it instantaneously.
- Zero-downtime deployment: traffic switch is atomic
- Instant rollback: switch traffic back to blue if issues emerge
- Higher infrastructure cost: two full environments running simultaneously
- Best for: high-availability services where any downtime is unacceptable
Canary Deployment
Route a small percentage of traffic (1โ5%) to the new model version. Gradually increase the percentage as confidence builds.
- Limits blast radius if the new model has issues
- Automated promotion: increase canary % on positive metric trend
- Automated rollback: route back to previous model if metrics degrade
- Requires traffic splitting infrastructure (Istio, AWS ALB, Envoy)
- Best for: gradual validation with real traffic before full rollout
Shadow Mode
The new model receives all requests and generates predictions โ but those predictions are logged, never served. Users always see the old model's output.
- Zero risk to users: new model has no effect on production responses
- Validate output distribution, latency, and error rates offline
- Compare new vs old predictions on the same real inputs
- Higher infrastructure cost: both models must handle full traffic load
- Best for: high-stakes regulated industries (healthcare, finance, legal)
Feature Flags for Model Versions
Use feature flag infrastructure to control which model version a request uses at the application layer, independent of deployment infrastructure.
- Enable new model for specific user segments (beta users, internal team)
- Toggle model versions without redeployment
- A/B test at user level rather than request level for consistent experience
- Tools: LaunchDarkly, Flagsmith, OpenFeature, in-house flag service
Shadow Mode Is the Safest Validation Path
Shadow mode is the most rigorous way to validate a new model before it affects real users. By logging but never serving new model predictions, you can build weeks of evidence about real-world behavior โ output distributions, latency under load, handling of edge cases, error rates โ with absolutely zero risk to users. For any high-stakes deployment (healthcare diagnoses, financial decisions, safety systems), shadow mode should be the mandatory gate before canary or blue/green promotion.
Automated Rollback Triggers
| Metric | Rollback Trigger Threshold | Response Time |
|---|---|---|
| Serving error rate | > 1% for 5 consecutive minutes | Immediate automated rollback |
| P99 inference latency | > 2ร baseline for 10 minutes | Automated rollback + alert |
| Output distribution shift | PSI > 0.2 vs shadow baseline | Pager alert + manual review |
| Downstream business metric | Click-through drops > 5% vs control | Canary halt + team review |
| Model quality degradation | Online AUC drops > 3% vs champion | Automated rollback + incident |