CI/CD for ML Pipelines | CyberHub AI & ML

⚙️ How ML CI/CD Differs from Software CI/CD

Traditional software CI/CD is well-understood: commit code, run tests, build artifact, deploy. ML CI/CD has all of this complexity plus a second axis of change — the data — that makes automation significantly more challenging.

Two Sources of Change

In software, only code triggers a new deployment. In ML, both code and data can independently invalidate a deployed model.

Code changes: new model architecture, updated feature engineering, changed hyperparameters
Data changes: new training data arrives, upstream schema changes, distribution shift detected
Scheduled retraining: time-based triggers regardless of code/data changes
Both triggers must eventually produce the same pipeline outcome: a validated model in production

Tests Are Fundamentally Different

Software CI/CD tests check behavioral correctness (unit tests, integration tests). ML CI/CD must also test data quality and model quality — both of which are probabilistic and threshold-based.

Data validation: schema checks, distribution checks, no-null assertions
Model quality gates: accuracy, F1, AUC must exceed a threshold to proceed
Regression tests: new model must not regress on a golden evaluation set
Bias and fairness checks: performance across demographic subgroups
Latency tests: inference must complete within SLA constraints

Artifacts Are Model Weights, Not Binaries

Software CI/CD produces a Docker image or binary. ML CI/CD produces model weights — potentially gigabytes in size — that must be versioned, stored, and traced back to the exact data and code that produced them.

Model artifacts include: weights file, preprocessing pipeline, feature schema, training metadata
Must be stored in a model registry, not a code repository
Full lineage required: which data version + code version + hyperparameters produced this model
Rollback means re-deploying a previous model artifact, not reverting code

Rollback Complexity

In software, rollback means reverting to the previous code version. In ML, rollback means re-deploying a previous model version — which may have been trained on data that is no longer representative of current reality.

Previous model version may be "safe" but also "wrong" for current data distribution
Rollback should be automatic when online metrics degrade beyond threshold
Rollback SLA (how fast can you revert?) is a key system design requirement
Shadow mode testing reduces rollback frequency by validating before full deployment

ML CI/CD vs Software CI/CD Comparison

Dimension	Software CI/CD	ML CI/CD
Triggers	Code commit / PR merge	Code commit, data update, scheduled retraining, drift alert
Tests	Unit tests, integration tests, linting	Data validation, model quality gates, regression on golden set, bias checks, latency tests
Primary artifact	Docker image, compiled binary	Model weights + preprocessing pipeline + metadata package
Artifact storage	Container registry (ECR, GCR)	Model registry (MLflow, SageMaker, Vertex AI Model Registry)
Rollback mechanism	Redeploy previous image tag	Reload previous model version from registry; may need A/B shadow period
Pipeline latency	Minutes (build + test)	Hours to days (data validation + training + evaluation)
Reproducibility requirement	Deterministic builds	Same data + code + seeds → same model (requires full lineage tracking)

🔄 ML Pipeline Stages

A production ML pipeline is a directed acyclic graph (DAG) of steps, each of which can be independently versioned, cached, and re-run. Getting the stage boundaries right is critical for efficient iteration.

Stage 1: Data Validation

Schema check: column names, dtypes, expected ranges match specification
Distribution check: feature statistics within expected bounds vs baseline
Completeness: no-null assertions on required columns
Row count: sanity check that data volume is in expected range
Reject batch and alert on failure — do not proceed to training

Stage 2: Feature Engineering

Apply transforms (scaling, encoding, imputation) fitted on training set only
Save the fitted preprocessing pipeline as an artifact
Validate output feature schema before training begins
Log feature statistics for drift comparison at serving time

Stage 3: Training

Fix all random seeds for reproducibility
Log hyperparameters, metrics per epoch, and final metrics to experiment tracker
Checkpoint model periodically (resume from checkpoint on failure)
Train on training split only; validation split for early stopping only

Stage 4: Evaluation & Quality Gates

Evaluate on held-out test set with all quality gate metrics
Compare against current production model (champion/challenger)
Run bias/fairness evaluation across demographic slices
Run latency profiling: P50, P95, P99 inference times
Block registration if any gate fails; notify on-call team

Stage 5: Registration

Push model + preprocessing pipeline + metadata to model registry
Tag with training run ID, data version, Git SHA, evaluation metrics
Set model stage to "Staging" (requires separate promotion to Production)
Trigger integration test suite against staging serving endpoint

Stage 6: Deployment

Automated deployment to staging environment on registry push
Manual approval gate before production deployment (for high-stakes models)
Blue/green or canary rollout strategy
Update monitoring dashboards with new model version baseline metrics

ML Pipeline Orchestration Tools

Tool	Strengths	Hosted Option	Learning Curve
Kubeflow Pipelines	Kubernetes-native; containerized steps; strong lineage; scalable	Google Vertex AI Pipelines	High — requires Kubernetes familiarity
MLflow Pipelines	Tight MLflow integration; opinionated template structure; simple YAML config	Databricks (managed MLflow)	Low — familiar to MLflow users
Metaflow	Python-native; Netflix-proven; great for data scientists; S3/cloud step isolation	Outerbounds (managed Metaflow)	Low-Medium — pure Python decorators
Apache Airflow	Battle-tested; huge operator ecosystem; general-purpose DAG scheduling	Astronomer, Cloud Composer, MWAA	Medium — DAG authoring in Python
Prefect	Modern Python-first; excellent observability; easier than Airflow for ML teams	Prefect Cloud	Low — minimal boilerplate
ZenML	Framework-agnostic; stack-based deployment abstraction; stack recipes	ZenML Cloud	Low-Medium

🧪 Testing in ML Pipelines

Data Tests with Great Expectations

Great Expectations (GX) provides a declarative framework for defining data quality rules as "expectations" that run as part of the pipeline.

Schema expectations: column exists, column type matches, no unexpected columns
Distribution expectations: mean between X and Y, std below Z, min/max within range
Completeness: null percentage below threshold for required columns
Uniqueness: primary key column has no duplicates
Produces HTML Data Docs reports — shareable with non-technical stakeholders

Data Tests with Pandera

Pandera provides a pandas-native schema validation library with a Pythonic API — simpler than GX for straightforward schema checks.

Define schema as a Python class (DataFrameSchema or SchemaModel)
Column-level checks: dtype, nullable, unique, value ranges, regex patterns
Row-level checks: cross-column validation rules
Integrates with pandas, polars, and modin DataFrames
Raises SchemaError with descriptive message on validation failure

Model Quality Tests

Model tests verify that a freshly trained model meets minimum performance standards before registration.

Performance thresholds: AUC > 0.85, precision > 0.80, F1 > 0.78 — defined before training begins
Champion/challenger comparison: new model must match or exceed current production model's metrics
Golden dataset regression: fixed labeled set — new model must get the same predictions on canonical examples
Slice evaluation: check performance on demographic subgroups, rare classes, edge cases

Smoke Tests for Serving

Before a model goes live, smoke tests validate that the serving infrastructure works end-to-end with a small set of known inputs.

Send 10–20 canonical requests and verify response schema, latency, and prediction format
Test edge cases: empty input, missing features, maximum payload size
Verify that preprocessing pipeline is correctly applied at serving time
Check that model version in response headers matches expected deployment

Great Expectations Checkpoint YAML Skeleton

# great_expectations/checkpoints/training_data_checkpoint.yml
name: training_data_checkpoint
config_version: 1.0

class_name: SimpleCheckpoint
run_name_template: "training_data_%Y%m%d_%H%M%S"

validations:
  - batch_request:
      datasource_name: training_data_source
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: features.parquet
      data_connector_query:
        index: -1  # most recent batch

    expectation_suite_name: training_features.warning

    action_list:
      - name: store_validation_result
        action:
          class_name: StoreValidationResultAction
      - name: store_evaluation_params
        action:
          class_name: StoreEvaluationParametersAction
      - name: update_data_docs
        action:
          class_name: UpdateDataDocsAction
      # Fail pipeline if ANY expectation fails
      - name: send_slack_notification_on_failure
        action:
          class_name: SlackNotificationAction
          slack_webhook: ${SLACK_WEBHOOK_URL}
          notify_on: failure

---
# Expectation suite definition (abbreviated)
# great_expectations/expectations/training_features.warning.json
{
  "expectation_suite_name": "training_features.warning",
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": { "column": "user_id" }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": { "column": "label", "mostly": 1.0 }
    },
    {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": { "column": "feature_score", "min_value": 0.3, "max_value": 0.7 }
    },
    {
      "expectation_type": "expect_table_row_count_to_be_between",
      "kwargs": { "min_value": 10000, "max_value": 5000000 }
    }
  ]
}

🐙 GitHub Actions for ML

Trigger Strategies

push / pull_request: trigger on code changes to model or pipeline files
schedule: cron-based retraining (e.g., weekly retrain with latest data)
workflow_dispatch: manual trigger with input parameters (data version, hyperparams)
repository_dispatch: triggered by external system (data pipeline webhook, drift alert)
Use path filters to only trigger on relevant file changes (avoid retraining on README edits)

Dependency Caching

Cache pip dependencies using actions/cache keyed on requirements.txt hash
Cache downloaded datasets (if small enough) between runs
Use DVC remote cache to avoid re-processing unchanged data stages
Docker layer caching: push base image to registry and pull in subsequent runs

GPU Runners

GitHub-hosted runners are CPU-only; GPU training requires self-hosted or third-party
Self-hosted runners: register your GPU server as a GitHub Actions runner
RunsOn / Ubicloud / Buildjet: managed GPU runner providers for GitHub Actions
For large training jobs, trigger cloud training (SageMaker, Vertex AI) from the action and poll for completion

Secrets & Environment Management

Store cloud credentials, API keys, and registry tokens in GitHub Secrets
Use environment protection rules to require review before production deployments
OIDC federation: get short-lived cloud credentials without storing long-lived secrets
Use GitHub Environments to separate staging and production credentials cleanly

GitHub Actions ML Pipeline Workflow

# .github/workflows/train.yml
name: ML Training Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'configs/**'
      - 'requirements.txt'
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2am UTC
  workflow_dispatch:
    inputs:
      data_version:
        description: 'DVC data version tag'
        required: false
        default: 'latest'

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_DEFAULT_REGION: us-east-1

jobs:
  # ── Job 1: Validate data ──────────────────────────────────────────
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Cache pip dependencies
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Pull data with DVC
        run: |
          dvc pull data/training/
          echo "Data pulled successfully"

      - name: Run Great Expectations data validation
        run: |
          python -m great_expectations checkpoint run training_data_checkpoint
        env:
          GE_HOME: ./great_expectations

  # ── Job 2: Train model ────────────────────────────────────────────
  train:
    needs: validate-data
    runs-on: [self-hosted, gpu]  # GPU self-hosted runner
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Train model
        run: |
          python src/train.py \
            --config configs/train_config.yaml \
            --data-version ${{ github.event.inputs.data_version || 'latest' }} \
            --run-name "ci-${{ github.sha }}"
        env:
          MLFLOW_EXPERIMENT_NAME: production-model

      - name: Export run ID
        id: run_info
        run: echo "run_id=$(cat run_id.txt)" >> $GITHUB_OUTPUT

    outputs:
      run_id: ${{ steps.run_info.outputs.run_id }}

  # ── Job 3: Evaluate and quality gate ─────────────────────────────
  evaluate:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation with quality gates
        run: |
          python src/evaluate.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --min-auc 0.85 \
            --min-f1 0.78 \
            --compare-champion true

      - name: Run bias evaluation
        run: |
          python src/bias_eval.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --max-demographic-gap 0.05

  # ── Job 4: Register model (only on main branch merge) ────────────
  register:
    needs: evaluate
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: staging  # requires environment approval
    steps:
      - uses: actions/checkout@v4

      - name: Register model to MLflow Registry
        run: |
          python src/register_model.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --model-name production-classifier \
            --stage Staging

      - name: Run smoke tests against staging endpoint
        run: python tests/smoke_test_serving.py --env staging

🚀 Deployment Strategies in ML

Blue/Green Deployment

Run two identical serving environments — "blue" (current) and "green" (new). When green passes all tests, flip 100% of traffic to it instantaneously.

Zero-downtime deployment: traffic switch is atomic
Instant rollback: switch traffic back to blue if issues emerge
Higher infrastructure cost: two full environments running simultaneously
Best for: high-availability services where any downtime is unacceptable

Canary Deployment

Route a small percentage of traffic (1–5%) to the new model version. Gradually increase the percentage as confidence builds.

Limits blast radius if the new model has issues
Automated promotion: increase canary % on positive metric trend
Automated rollback: route back to previous model if metrics degrade
Requires traffic splitting infrastructure (Istio, AWS ALB, Envoy)
Best for: gradual validation with real traffic before full rollout

Shadow Mode

The new model receives all requests and generates predictions — but those predictions are logged, never served. Users always see the old model's output.

Zero risk to users: new model has no effect on production responses
Validate output distribution, latency, and error rates offline
Compare new vs old predictions on the same real inputs
Higher infrastructure cost: both models must handle full traffic load
Best for: high-stakes regulated industries (healthcare, finance, legal)

Feature Flags for Model Versions

Use feature flag infrastructure to control which model version a request uses at the application layer, independent of deployment infrastructure.

Enable new model for specific user segments (beta users, internal team)
Toggle model versions without redeployment
A/B test at user level rather than request level for consistent experience
Tools: LaunchDarkly, Flagsmith, OpenFeature, in-house flag service

Shadow Mode Is the Safest Validation Path

Shadow mode is the most rigorous way to validate a new model before it affects real users. By logging but never serving new model predictions, you can build weeks of evidence about real-world behavior — output distributions, latency under load, handling of edge cases, error rates — with absolutely zero risk to users. For any high-stakes deployment (healthcare diagnoses, financial decisions, safety systems), shadow mode should be the mandatory gate before canary or blue/green promotion.

Automated Rollback Triggers

Metric	Rollback Trigger Threshold	Response Time
Serving error rate	> 1% for 5 consecutive minutes	Immediate automated rollback
P99 inference latency	> 2× baseline for 10 minutes	Automated rollback + alert
Output distribution shift	PSI > 0.2 vs shadow baseline	Pager alert + manual review
Downstream business metric	Click-through drops > 5% vs control	Canary halt + team review
Model quality degradation	Online AUC drops > 3% vs champion	Automated rollback + incident