⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025

🗄️ What Is a Model Registry

A model registry is the system of record for trained models. It sits between your experiment tracking system and your deployment infrastructure, answering the question: "Of all the model versions that exist, which one should be running in production right now — and who decided that?" A registry is not a file server. It enforces a lifecycle, tracks lineage, and gates promotion through defined stages.

Lifecycle Stages

Every model version in a registry occupies exactly one lifecycle stage at any point in time. Transitions are logged with timestamps and the identity of the approver.

  • None / Candidate — registered from an experiment run, not yet evaluated for production
  • Staging — passing automated quality gates, undergoing human review or shadow testing
  • Production — the canonical serving version; may have multiple versions in some systems
  • Archived — retired from service; kept for auditing and potential rollback

Lineage Tracking

A registry entry is a pointer to a complete provenance chain, not just a weights file. Full lineage answers regulators, debuggers, and incident responders:

  • Which training data version (DVC hash / S3 path) produced this model?
  • Which git commit of training code was used?
  • Which experiment run (MLflow run ID) generated these weights?
  • Who approved the promotion to Production and when?
  • What evaluation metrics did this version achieve?

Approval Workflows

Production model updates should not be automatic — they require human sign-off, especially in regulated domains. A registry formalises this:

  • Automated gates: performance thresholds, bias checks, latency benchmarks
  • Human review: ML engineer approves Staging → Production transition
  • Notification hooks: Slack alerts on stage transitions
  • Audit log: immutable record of who approved what and when
  • Rollback path: one-click revert to previous Production version

Without Registry vs With Registry

ConcernWithout a Model RegistryWith a Model Registry
Deployment trackingShared spreadsheet or institutional memoryQueryable database: which version is in prod right now
RollbackRe-run training script, hope it reproducesInstant: transition previous version back to Production
Compliance auditManual investigation across experiment logsComplete lineage trace in seconds via API
Multi-team coordinationEmail / Slack "hey I updated the model"Notification webhooks, stage-change events
Model quality gateDepends on individual disciplineAutomated threshold checks block bad models
Artifact locationHard-coded S3 paths in deploy scriptsRegistry returns canonical URI for current Production version

🔧 MLflow Model Registry

MLflow's Model Registry is the most widely deployed open-source registry. It is tightly integrated with the MLflow Tracking system — you register a model directly from a run's artifact, and the registry maintains the full link back to the run's parameters, metrics, and metadata. As of MLflow 2.x, aliases (Champion/Challenger) replace the older stage terminology.

Model Versions & Aliases

Every registration creates a new immutable version number. Aliases are mutable pointers that deployment code can follow without hardcoding version numbers:

  • Version — immutable integer (1, 2, 3…); links back to the originating run
  • @champion — alias pointing to the currently best model
  • @challenger — alias for the new candidate being A/B tested
  • Serving code loads by alias: models:/fraud-detector@champion
  • Aliases can be reassigned atomically with no serving downtime

Serving from the Registry

MLflow can serve a registered model directly, or you can load it programmatically in your own serving layer:

  • CLI: mlflow models serve -m "models:/fraud-detector@champion"
  • Python: mlflow.sklearn.load_model("models:/fraud-detector/3")
  • Docker: mlflow models build-docker
  • Spark UDF for batch inference on DataFrames
  • Databricks Model Serving for production-grade autoscaling

MLflow Registry Architecture

The registry requires a backend store (SQL DB) and an artifact store (object storage). In production:

  • Backend store: PostgreSQL or MySQL (not SQLite — it's single-writer)
  • Artifact store: S3, GCS, Azure Blob, or NFS
  • Auth: Databricks-managed or self-hosted with reverse proxy + OIDC
  • Webhooks: trigger Jenkins/GitHub Actions on stage transitions
  • REST API: every action is available programmatically for automation

Register and Transition a Model in Python

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud-detector"

# ── Option A: Register during training (from log_model) ────────────
with mlflow.start_run() as run:
    # ... training code ...
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name=MODEL_NAME,   # auto-creates or adds version
    )
    run_id = run.info.run_id

# ── Option B: Register an existing run's artifact ─────────────────
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name=MODEL_NAME,
)
version = result.version
print(f"Registered as version {version}")

# ── Wait for registration to complete ─────────────────────────────
import time
for _ in range(10):
    mv = client.get_model_version(MODEL_NAME, version)
    if mv.status == "READY":
        break
    time.sleep(2)

# ── Add descriptive metadata ──────────────────────────────────────
client.update_model_version(
    name=MODEL_NAME,
    version=version,
    description="XGBoost v3; trained on 2024-11 data; AUC=0.972, F1=0.891",
)
client.set_model_version_tag(MODEL_NAME, version, "dataset_version", "v2024-11")
client.set_model_version_tag(MODEL_NAME, version, "approved_by", "akumar")

# ── Assign Champion alias (MLflow 2.x preferred approach) ─────────
client.set_registered_model_alias(MODEL_NAME, "challenger", version)

# After shadow / A/B testing passes — promote challenger to champion
client.set_registered_model_alias(MODEL_NAME, "champion", version)

# ── Load by alias in serving code ─────────────────────────────────
champion_model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}@champion")

# ── Searching registered models ───────────────────────────────────
for mv in client.search_model_versions(f"name='{MODEL_NAME}'"):
    print(f"  v{mv.version}: {mv.current_stage} | {mv.description[:50]}")

# ── Legacy stage transitions (MLflow 1.x / 2.x compat) ───────────
client.transition_model_version_stage(
    name=MODEL_NAME,
    version=version,
    stage="Production",
    archive_existing_versions=True,   # archives previous Production version
)

# ── Get current production version URI for serving ────────────────
prod_versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
prod_uri = f"models:/{MODEL_NAME}/{prod_versions[0].version}"
print(f"Production model URI: {prod_uri}")

☁️ Cloud Registries

All major cloud providers offer managed model registries deeply integrated with their ML platforms. For teams already committed to a cloud provider, these offer better security, audit logging, and integration with managed training and serving infrastructure than self-hosted MLflow.

PlatformRegistry ProductKey IntegrationNotes
AWS SageMaker Model Registry SageMaker Pipelines, Endpoints, CodePipeline CI/CD Model groups with approval workflows; integrates with IAM for fine-grained access control; supports multi-account deployment
Google Cloud Vertex AI Model Registry Vertex AI Pipelines, Endpoints, Feature Store Container-first — stores Docker image URIs; evaluation metrics tracked per version; direct integration with Vertex Explainability
Microsoft Azure Azure ML Model Registry Azure ML Pipelines, Online/Batch Endpoints, Responsible AI dashboard First-class model card support; deep integration with Azure DevOps; supports MLflow models natively
HuggingFace Hub HuggingFace Model Hub Transformers, Diffusers, PEFT, Inference API Best for open model sharing; version via git-based repo; model cards are first-class; private repos require paid plan; Spaces for demos
Databricks Unity Catalog (Models) MLflow 2.x, Delta Lake, Feature Store, Model Serving Unified governance across data + models in one catalogue; fine-grained ACLs; lifecycle policies for automated archival

Choosing Between Self-Hosted MLflow and Cloud Registry

If your team is cloud-native and uses SageMaker/Vertex/Azure ML for training and serving, the cloud registry is the natural choice — the integration is seamless. If you're multi-cloud, on-premises, or want to avoid vendor lock-in, self-hosted MLflow on Kubernetes is mature, free, and widely understood. The MLflow Python client works with all registries that expose the MLflow REST API, including Databricks Unity Catalog.

📁 Artifact Management

An artifact is any file output of a training run that needs to be preserved. Artifact management is the discipline of storing these files reliably, making them addressable by content (not just path), and ensuring that the right set of artifacts is always co-located when a model is deployed.

What Counts as an Artifact

The common mistake is to store only the model weights and assume everything else can be reconstructed. This breaks deployment:

  • Model weights — pickle, ONNX, TorchScript, SavedModel, safetensors
  • Preprocessing pipeline — StandardScaler, OrdinalEncoder, imputer fitted on training data
  • Tokenizer — for NLP models; vocabulary and special tokens are model-specific
  • Feature configuration — which columns, in what order, what dtype
  • Threshold configuration — classification cutoff, confidence threshold
  • Evaluation reports — confusion matrix, PR curve, per-slice metrics as CSV/HTML
  • Model card — documentation as a YAML or Markdown artifact

Artifact Stores

An artifact store is the backing storage where actual file bytes live. The registry stores pointers; the artifact store holds the content:

  • Amazon S3 — most common; use versioning + lifecycle policies
  • Google Cloud Storage — tight Vertex AI integration
  • Azure Blob Storage — pairs with Azure ML Registry
  • MLflow managed artifacts — can back onto any of the above
  • Local NFS — on-prem option; use a distributed filesystem for HA

Content Addressing vs Path Addressing

Path addressing (s3://bucket/models/v3/model.pkl) is fragile — the file can be overwritten silently. Content addressing stores files by their hash:

  • DVC uses MD5 content hashes: same content = same storage key
  • MLflow artifact URIs include the run UUID, preventing collisions
  • S3 Object Lock prevents overwrite for compliance
  • Content-addressed storage enables deduplication across model versions
  • Enables integrity verification: re-hash at load time to detect corruption

Always Store the Preprocessing Pipeline With the Model

The most common source of training-serving skew is a mismatch between the preprocessing applied at training time and the preprocessing applied at inference time. If your scaler was fit on training data with mean=42.3 and std=8.1, but the serving code creates a new scaler or hardcodes different values, your model's inputs will be out of distribution and performance will silently degrade. The solution: serialise the fitted preprocessor as part of the same model artifact bundle, and always load them together.

Artifact Lifecycle

StageRetention PolicyStorage TierReason
Active experiment runs (last 30 days)Retain allS3 StandardFrequent comparison and iteration
Staged modelsRetain while in Staging + 90 days after archivalS3 StandardActive evaluation and shadow testing
Production models (current + 2 prior)Retain indefinitelyS3 StandardRollback capability and compliance
Archived models (older than 1 year)Retain 3 years (or per regulation)S3 Glacier InstantCompliance audit at low cost
Failed / abandoned runsDelete after 60 daysS3 Standard-IACost management

📋 Model Cards & Documentation

A Model Card (Mitchell et al., 2019) is a short structured document attached to a model that describes its intended uses, performance characteristics, and limitations. Originally a research proposal, model cards are now a regulatory expectation in financial services (EU AI Act, SR 11-7), healthcare, and hiring — and a best practice everywhere else. A model without a card is an undocumented system in production.

Model Card Sections (Mitchell et al.)

  • Model details — name, version, type, training date, contact
  • Intended use — primary use case, intended users, out-of-scope uses
  • Factors — demographic groups, environmental conditions considered
  • Metrics — performance measures used and why they were chosen
  • Evaluation data — datasets used for evaluation and their properties
  • Training data — summary of training data (privacy permitting)
  • Quantitative analyses — per-group performance disaggregation
  • Ethical considerations — risks, mitigations, limitations
  • Caveats and recommendations — edge cases, update triggers

Why Regulators Are Requiring Them

The EU AI Act (effective August 2026) mandates technical documentation for high-risk AI systems that must be kept updated throughout the model's lifecycle.

  • EU AI Act Article 11 — technical documentation including training data description and performance metrics
  • US NIST AI RMF — model cards as a governance artefact in the "Govern" function
  • SR 11-7 (banking) — model validation documentation requirements effectively mandate model cards
  • GDPR Article 22 — right to explanation for automated decisions
  • FDA guidance — software as a medical device requires performance characterisation per subgroup

HuggingFace Model Card Format

HuggingFace established a widely adopted YAML-frontmatter format for model cards that tooling can parse. The metadata section drives model discoverability in the Hub.

  • YAML frontmatter: language, license, datasets, metrics, tags
  • Markdown body: free-form sections following Mitchell et al.
  • huggingface_hub.ModelCard Python class for programmatic creation
  • Automatic render in Hub UI with metric tables and badges
  • Training metadata auto-populated by Trainer callback

Model Card YAML Template

# model_card.yaml
# Store this as an artifact in the model registry alongside the weights

model_details:
  name: "fraud-transaction-detector"
  version: "3.2.1"
  type: "XGBoost binary classifier"
  description: >
    Detects fraudulent credit card transactions in real time.
    Trained on anonymised transaction logs from 2022-2024.
  license: "Proprietary — internal use only"
  contact: "[email protected]"
  training_date: "2024-11-15"
  framework: "XGBoost 2.0.3 / scikit-learn 1.4.0"
  registry_uri: "models:/fraud-detector@champion"

intended_use:
  primary_use: "Real-time fraud scoring at payment authorisation time"
  intended_users:
    - "Fraud operations analysts"
    - "Payment processing pipeline (automated)"
  out_of_scope:
    - "Account takeover detection (different model)"
    - "Transaction amounts > $50,000 (insufficient training data)"
    - "Cryptocurrency transactions"

training_data:
  description: "Anonymised Visa/Mastercard transactions, US market, 2022-01 to 2024-10"
  size: "142M transactions (0.18% positive class)"
  preprocessing: "SMOTE oversampling; StandardScaler on amount; frequency encoding on merchant_id"
  note: "Full schema cannot be disclosed for privacy reasons"

evaluation_data:
  description: "Held-out 20% split, stratified by fraud rate, 2024-09 to 2024-10"
  size: "28.4M transactions"

metrics:
  overall:
    roc_auc: 0.9724
    precision_at_0.5: 0.8312
    recall_at_0.5: 0.7644
    f1_at_0.5: 0.7964
    false_positive_rate: 0.0021

  per_group:
    card_type:
      visa:   { roc_auc: 0.9731, f1: 0.7998 }
      master:  { roc_auc: 0.9708, f1: 0.7901 }
    transaction_type:
      card_present:     { roc_auc: 0.9688, f1: 0.7812 }
      card_not_present: { roc_auc: 0.9751, f1: 0.8043 }

limitations:
  - "Performance degrades for merchants with < 100 historical transactions"
  - "Not calibrated for transaction amounts > $10,000"
  - "Assumes feature pipeline version >= 2.4.0; older pipelines produce different feature values"
  - "May underperform during major retail events (Black Friday) — consider threshold adjustment"

ethical_considerations:
  - concern: "Geographic disparity"
    mitigation: "Per-region performance monitoring; alert if regional FPR diverges by > 0.5%"
  - concern: "Fraud pattern evolution"
    mitigation: "Monthly PSI checks on input features; retrain trigger if PSI > 0.2"

caveats:
  - "Retrain when monthly fraud rate changes by > 15% relative to training baseline"
  - "Model card must be updated before any Production transition"
  - "Threshold may need recalibration after any retraining event"

model_card_version: "1.0"
last_updated: "2024-11-15"
approved_by: "akumar"

Automating Model Card Generation

Manually written model cards become stale. The better approach: generate the metrics section programmatically from the evaluation pipeline, then populate it into the YAML template before registering the model. This way the card is always accurate — a human writes the context sections once, and the pipeline fills in the numbers. Store the card as an MLflow artifact alongside the weights so it travels with the model wherever it goes.