MLOps Series: Model Versioning & Tracking Model Registry & Artifacts CI/CD for ML Monitoring & Drift A/B Testing & Canary Containerization & Orchestration
← Model Registry & Artifacts Monitoring & Drift →
โฑ 9 min read ๐Ÿ“Š Intermediate ๐Ÿ—“ Updated Jan 2025
โš™๏ธ How ML CI/CD Differs from Software CI/CD

Traditional software CI/CD is well-understood: commit code, run tests, build artifact, deploy. ML CI/CD has all of this complexity plus a second axis of change โ€” the data โ€” that makes automation significantly more challenging.

Two Sources of Change

In software, only code triggers a new deployment. In ML, both code and data can independently invalidate a deployed model.

  • Code changes: new model architecture, updated feature engineering, changed hyperparameters
  • Data changes: new training data arrives, upstream schema changes, distribution shift detected
  • Scheduled retraining: time-based triggers regardless of code/data changes
  • Both triggers must eventually produce the same pipeline outcome: a validated model in production

Tests Are Fundamentally Different

Software CI/CD tests check behavioral correctness (unit tests, integration tests). ML CI/CD must also test data quality and model quality โ€” both of which are probabilistic and threshold-based.

  • Data validation: schema checks, distribution checks, no-null assertions
  • Model quality gates: accuracy, F1, AUC must exceed a threshold to proceed
  • Regression tests: new model must not regress on a golden evaluation set
  • Bias and fairness checks: performance across demographic subgroups
  • Latency tests: inference must complete within SLA constraints

Artifacts Are Model Weights, Not Binaries

Software CI/CD produces a Docker image or binary. ML CI/CD produces model weights โ€” potentially gigabytes in size โ€” that must be versioned, stored, and traced back to the exact data and code that produced them.

  • Model artifacts include: weights file, preprocessing pipeline, feature schema, training metadata
  • Must be stored in a model registry, not a code repository
  • Full lineage required: which data version + code version + hyperparameters produced this model
  • Rollback means re-deploying a previous model artifact, not reverting code

Rollback Complexity

In software, rollback means reverting to the previous code version. In ML, rollback means re-deploying a previous model version โ€” which may have been trained on data that is no longer representative of current reality.

  • Previous model version may be "safe" but also "wrong" for current data distribution
  • Rollback should be automatic when online metrics degrade beyond threshold
  • Rollback SLA (how fast can you revert?) is a key system design requirement
  • Shadow mode testing reduces rollback frequency by validating before full deployment

ML CI/CD vs Software CI/CD Comparison

DimensionSoftware CI/CDML CI/CD
TriggersCode commit / PR mergeCode commit, data update, scheduled retraining, drift alert
TestsUnit tests, integration tests, lintingData validation, model quality gates, regression on golden set, bias checks, latency tests
Primary artifactDocker image, compiled binaryModel weights + preprocessing pipeline + metadata package
Artifact storageContainer registry (ECR, GCR)Model registry (MLflow, SageMaker, Vertex AI Model Registry)
Rollback mechanismRedeploy previous image tagReload previous model version from registry; may need A/B shadow period
Pipeline latencyMinutes (build + test)Hours to days (data validation + training + evaluation)
Reproducibility requirementDeterministic buildsSame data + code + seeds โ†’ same model (requires full lineage tracking)
๐Ÿ”„ ML Pipeline Stages

A production ML pipeline is a directed acyclic graph (DAG) of steps, each of which can be independently versioned, cached, and re-run. Getting the stage boundaries right is critical for efficient iteration.

Stage 1: Data Validation

  • Schema check: column names, dtypes, expected ranges match specification
  • Distribution check: feature statistics within expected bounds vs baseline
  • Completeness: no-null assertions on required columns
  • Row count: sanity check that data volume is in expected range
  • Reject batch and alert on failure โ€” do not proceed to training

Stage 2: Feature Engineering

  • Apply transforms (scaling, encoding, imputation) fitted on training set only
  • Save the fitted preprocessing pipeline as an artifact
  • Validate output feature schema before training begins
  • Log feature statistics for drift comparison at serving time

Stage 3: Training

  • Fix all random seeds for reproducibility
  • Log hyperparameters, metrics per epoch, and final metrics to experiment tracker
  • Checkpoint model periodically (resume from checkpoint on failure)
  • Train on training split only; validation split for early stopping only

Stage 4: Evaluation & Quality Gates

  • Evaluate on held-out test set with all quality gate metrics
  • Compare against current production model (champion/challenger)
  • Run bias/fairness evaluation across demographic slices
  • Run latency profiling: P50, P95, P99 inference times
  • Block registration if any gate fails; notify on-call team

Stage 5: Registration

  • Push model + preprocessing pipeline + metadata to model registry
  • Tag with training run ID, data version, Git SHA, evaluation metrics
  • Set model stage to "Staging" (requires separate promotion to Production)
  • Trigger integration test suite against staging serving endpoint

Stage 6: Deployment

  • Automated deployment to staging environment on registry push
  • Manual approval gate before production deployment (for high-stakes models)
  • Blue/green or canary rollout strategy
  • Update monitoring dashboards with new model version baseline metrics

ML Pipeline Orchestration Tools

ToolStrengthsHosted OptionLearning Curve
Kubeflow PipelinesKubernetes-native; containerized steps; strong lineage; scalableGoogle Vertex AI PipelinesHigh โ€” requires Kubernetes familiarity
MLflow PipelinesTight MLflow integration; opinionated template structure; simple YAML configDatabricks (managed MLflow)Low โ€” familiar to MLflow users
MetaflowPython-native; Netflix-proven; great for data scientists; S3/cloud step isolationOuterbounds (managed Metaflow)Low-Medium โ€” pure Python decorators
Apache AirflowBattle-tested; huge operator ecosystem; general-purpose DAG schedulingAstronomer, Cloud Composer, MWAAMedium โ€” DAG authoring in Python
PrefectModern Python-first; excellent observability; easier than Airflow for ML teamsPrefect CloudLow โ€” minimal boilerplate
ZenMLFramework-agnostic; stack-based deployment abstraction; stack recipesZenML CloudLow-Medium
๐Ÿงช Testing in ML Pipelines

Data Tests with Great Expectations

Great Expectations (GX) provides a declarative framework for defining data quality rules as "expectations" that run as part of the pipeline.

  • Schema expectations: column exists, column type matches, no unexpected columns
  • Distribution expectations: mean between X and Y, std below Z, min/max within range
  • Completeness: null percentage below threshold for required columns
  • Uniqueness: primary key column has no duplicates
  • Produces HTML Data Docs reports โ€” shareable with non-technical stakeholders

Data Tests with Pandera

Pandera provides a pandas-native schema validation library with a Pythonic API โ€” simpler than GX for straightforward schema checks.

  • Define schema as a Python class (DataFrameSchema or SchemaModel)
  • Column-level checks: dtype, nullable, unique, value ranges, regex patterns
  • Row-level checks: cross-column validation rules
  • Integrates with pandas, polars, and modin DataFrames
  • Raises SchemaError with descriptive message on validation failure

Model Quality Tests

Model tests verify that a freshly trained model meets minimum performance standards before registration.

  • Performance thresholds: AUC > 0.85, precision > 0.80, F1 > 0.78 โ€” defined before training begins
  • Champion/challenger comparison: new model must match or exceed current production model's metrics
  • Golden dataset regression: fixed labeled set โ€” new model must get the same predictions on canonical examples
  • Slice evaluation: check performance on demographic subgroups, rare classes, edge cases

Smoke Tests for Serving

Before a model goes live, smoke tests validate that the serving infrastructure works end-to-end with a small set of known inputs.

  • Send 10โ€“20 canonical requests and verify response schema, latency, and prediction format
  • Test edge cases: empty input, missing features, maximum payload size
  • Verify that preprocessing pipeline is correctly applied at serving time
  • Check that model version in response headers matches expected deployment

Great Expectations Checkpoint YAML Skeleton

# great_expectations/checkpoints/training_data_checkpoint.yml
name: training_data_checkpoint
config_version: 1.0

class_name: SimpleCheckpoint
run_name_template: "training_data_%Y%m%d_%H%M%S"

validations:
  - batch_request:
      datasource_name: training_data_source
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: features.parquet
      data_connector_query:
        index: -1  # most recent batch

    expectation_suite_name: training_features.warning

    action_list:
      - name: store_validation_result
        action:
          class_name: StoreValidationResultAction
      - name: store_evaluation_params
        action:
          class_name: StoreEvaluationParametersAction
      - name: update_data_docs
        action:
          class_name: UpdateDataDocsAction
      # Fail pipeline if ANY expectation fails
      - name: send_slack_notification_on_failure
        action:
          class_name: SlackNotificationAction
          slack_webhook: ${SLACK_WEBHOOK_URL}
          notify_on: failure

---
# Expectation suite definition (abbreviated)
# great_expectations/expectations/training_features.warning.json
{
  "expectation_suite_name": "training_features.warning",
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": { "column": "user_id" }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": { "column": "label", "mostly": 1.0 }
    },
    {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": { "column": "feature_score", "min_value": 0.3, "max_value": 0.7 }
    },
    {
      "expectation_type": "expect_table_row_count_to_be_between",
      "kwargs": { "min_value": 10000, "max_value": 5000000 }
    }
  ]
}
๐Ÿ™ GitHub Actions for ML

Trigger Strategies

  • push / pull_request: trigger on code changes to model or pipeline files
  • schedule: cron-based retraining (e.g., weekly retrain with latest data)
  • workflow_dispatch: manual trigger with input parameters (data version, hyperparams)
  • repository_dispatch: triggered by external system (data pipeline webhook, drift alert)
  • Use path filters to only trigger on relevant file changes (avoid retraining on README edits)

Dependency Caching

  • Cache pip dependencies using actions/cache keyed on requirements.txt hash
  • Cache downloaded datasets (if small enough) between runs
  • Use DVC remote cache to avoid re-processing unchanged data stages
  • Docker layer caching: push base image to registry and pull in subsequent runs

GPU Runners

  • GitHub-hosted runners are CPU-only; GPU training requires self-hosted or third-party
  • Self-hosted runners: register your GPU server as a GitHub Actions runner
  • RunsOn / Ubicloud / Buildjet: managed GPU runner providers for GitHub Actions
  • For large training jobs, trigger cloud training (SageMaker, Vertex AI) from the action and poll for completion

Secrets & Environment Management

  • Store cloud credentials, API keys, and registry tokens in GitHub Secrets
  • Use environment protection rules to require review before production deployments
  • OIDC federation: get short-lived cloud credentials without storing long-lived secrets
  • Use GitHub Environments to separate staging and production credentials cleanly

GitHub Actions ML Pipeline Workflow

# .github/workflows/train.yml
name: ML Training Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'configs/**'
      - 'requirements.txt'
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2am UTC
  workflow_dispatch:
    inputs:
      data_version:
        description: 'DVC data version tag'
        required: false
        default: 'latest'

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_DEFAULT_REGION: us-east-1

jobs:
  # โ”€โ”€ Job 1: Validate data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Cache pip dependencies
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Pull data with DVC
        run: |
          dvc pull data/training/
          echo "Data pulled successfully"

      - name: Run Great Expectations data validation
        run: |
          python -m great_expectations checkpoint run training_data_checkpoint
        env:
          GE_HOME: ./great_expectations

  # โ”€โ”€ Job 2: Train model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  train:
    needs: validate-data
    runs-on: [self-hosted, gpu]  # GPU self-hosted runner
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Train model
        run: |
          python src/train.py \
            --config configs/train_config.yaml \
            --data-version ${{ github.event.inputs.data_version || 'latest' }} \
            --run-name "ci-${{ github.sha }}"
        env:
          MLFLOW_EXPERIMENT_NAME: production-model

      - name: Export run ID
        id: run_info
        run: echo "run_id=$(cat run_id.txt)" >> $GITHUB_OUTPUT

    outputs:
      run_id: ${{ steps.run_info.outputs.run_id }}

  # โ”€โ”€ Job 3: Evaluate and quality gate โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  evaluate:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation with quality gates
        run: |
          python src/evaluate.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --min-auc 0.85 \
            --min-f1 0.78 \
            --compare-champion true

      - name: Run bias evaluation
        run: |
          python src/bias_eval.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --max-demographic-gap 0.05

  # โ”€โ”€ Job 4: Register model (only on main branch merge) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  register:
    needs: evaluate
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: staging  # requires environment approval
    steps:
      - uses: actions/checkout@v4

      - name: Register model to MLflow Registry
        run: |
          python src/register_model.py \
            --run-id ${{ needs.train.outputs.run_id }} \
            --model-name production-classifier \
            --stage Staging

      - name: Run smoke tests against staging endpoint
        run: python tests/smoke_test_serving.py --env staging
๐Ÿš€ Deployment Strategies in ML

Blue/Green Deployment

Run two identical serving environments โ€” "blue" (current) and "green" (new). When green passes all tests, flip 100% of traffic to it instantaneously.

  • Zero-downtime deployment: traffic switch is atomic
  • Instant rollback: switch traffic back to blue if issues emerge
  • Higher infrastructure cost: two full environments running simultaneously
  • Best for: high-availability services where any downtime is unacceptable

Canary Deployment

Route a small percentage of traffic (1โ€“5%) to the new model version. Gradually increase the percentage as confidence builds.

  • Limits blast radius if the new model has issues
  • Automated promotion: increase canary % on positive metric trend
  • Automated rollback: route back to previous model if metrics degrade
  • Requires traffic splitting infrastructure (Istio, AWS ALB, Envoy)
  • Best for: gradual validation with real traffic before full rollout

Shadow Mode

The new model receives all requests and generates predictions โ€” but those predictions are logged, never served. Users always see the old model's output.

  • Zero risk to users: new model has no effect on production responses
  • Validate output distribution, latency, and error rates offline
  • Compare new vs old predictions on the same real inputs
  • Higher infrastructure cost: both models must handle full traffic load
  • Best for: high-stakes regulated industries (healthcare, finance, legal)

Feature Flags for Model Versions

Use feature flag infrastructure to control which model version a request uses at the application layer, independent of deployment infrastructure.

  • Enable new model for specific user segments (beta users, internal team)
  • Toggle model versions without redeployment
  • A/B test at user level rather than request level for consistent experience
  • Tools: LaunchDarkly, Flagsmith, OpenFeature, in-house flag service

Shadow Mode Is the Safest Validation Path

Shadow mode is the most rigorous way to validate a new model before it affects real users. By logging but never serving new model predictions, you can build weeks of evidence about real-world behavior โ€” output distributions, latency under load, handling of edge cases, error rates โ€” with absolutely zero risk to users. For any high-stakes deployment (healthcare diagnoses, financial decisions, safety systems), shadow mode should be the mandatory gate before canary or blue/green promotion.

Automated Rollback Triggers

MetricRollback Trigger ThresholdResponse Time
Serving error rate> 1% for 5 consecutive minutesImmediate automated rollback
P99 inference latency> 2ร— baseline for 10 minutesAutomated rollback + alert
Output distribution shiftPSI > 0.2 vs shadow baselinePager alert + manual review
Downstream business metricClick-through drops > 5% vs controlCanary halt + team review
Model quality degradationOnline AUC drops > 3% vs championAutomated rollback + incident