🗄️ What Is a Model Registry
A model registry is the system of record for trained models. It sits between your experiment tracking system and your deployment infrastructure, answering the question: "Of all the model versions that exist, which one should be running in production right now — and who decided that?" A registry is not a file server. It enforces a lifecycle, tracks lineage, and gates promotion through defined stages.
Lifecycle Stages
Every model version in a registry occupies exactly one lifecycle stage at any point in time. Transitions are logged with timestamps and the identity of the approver.
- None / Candidate — registered from an experiment run, not yet evaluated for production
- Staging — passing automated quality gates, undergoing human review or shadow testing
- Production — the canonical serving version; may have multiple versions in some systems
- Archived — retired from service; kept for auditing and potential rollback
Lineage Tracking
A registry entry is a pointer to a complete provenance chain, not just a weights file. Full lineage answers regulators, debuggers, and incident responders:
- Which training data version (DVC hash / S3 path) produced this model?
- Which git commit of training code was used?
- Which experiment run (MLflow run ID) generated these weights?
- Who approved the promotion to Production and when?
- What evaluation metrics did this version achieve?
Approval Workflows
Production model updates should not be automatic — they require human sign-off, especially in regulated domains. A registry formalises this:
- Automated gates: performance thresholds, bias checks, latency benchmarks
- Human review: ML engineer approves Staging → Production transition
- Notification hooks: Slack alerts on stage transitions
- Audit log: immutable record of who approved what and when
- Rollback path: one-click revert to previous Production version
Without Registry vs With Registry
| Concern | Without a Model Registry | With a Model Registry |
|---|---|---|
| Deployment tracking | Shared spreadsheet or institutional memory | Queryable database: which version is in prod right now |
| Rollback | Re-run training script, hope it reproduces | Instant: transition previous version back to Production |
| Compliance audit | Manual investigation across experiment logs | Complete lineage trace in seconds via API |
| Multi-team coordination | Email / Slack "hey I updated the model" | Notification webhooks, stage-change events |
| Model quality gate | Depends on individual discipline | Automated threshold checks block bad models |
| Artifact location | Hard-coded S3 paths in deploy scripts | Registry returns canonical URI for current Production version |
🔧 MLflow Model Registry
MLflow's Model Registry is the most widely deployed open-source registry. It is tightly integrated with the MLflow Tracking system — you register a model directly from a run's artifact, and the registry maintains the full link back to the run's parameters, metrics, and metadata. As of MLflow 2.x, aliases (Champion/Challenger) replace the older stage terminology.
Model Versions & Aliases
Every registration creates a new immutable version number. Aliases are mutable pointers that deployment code can follow without hardcoding version numbers:
- Version — immutable integer (1, 2, 3…); links back to the originating run
- @champion — alias pointing to the currently best model
- @challenger — alias for the new candidate being A/B tested
- Serving code loads by alias:
models:/fraud-detector@champion - Aliases can be reassigned atomically with no serving downtime
Serving from the Registry
MLflow can serve a registered model directly, or you can load it programmatically in your own serving layer:
- CLI:
mlflow models serve -m "models:/fraud-detector@champion" - Python:
mlflow.sklearn.load_model("models:/fraud-detector/3") - Docker:
mlflow models build-docker - Spark UDF for batch inference on DataFrames
- Databricks Model Serving for production-grade autoscaling
MLflow Registry Architecture
The registry requires a backend store (SQL DB) and an artifact store (object storage). In production:
- Backend store: PostgreSQL or MySQL (not SQLite — it's single-writer)
- Artifact store: S3, GCS, Azure Blob, or NFS
- Auth: Databricks-managed or self-hosted with reverse proxy + OIDC
- Webhooks: trigger Jenkins/GitHub Actions on stage transitions
- REST API: every action is available programmatically for automation
Register and Transition a Model in Python
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud-detector"
# ── Option A: Register during training (from log_model) ────────────
with mlflow.start_run() as run:
# ... training code ...
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name=MODEL_NAME, # auto-creates or adds version
)
run_id = run.info.run_id
# ── Option B: Register an existing run's artifact ─────────────────
result = mlflow.register_model(
model_uri=f"runs:/{run_id}/model",
name=MODEL_NAME,
)
version = result.version
print(f"Registered as version {version}")
# ── Wait for registration to complete ─────────────────────────────
import time
for _ in range(10):
mv = client.get_model_version(MODEL_NAME, version)
if mv.status == "READY":
break
time.sleep(2)
# ── Add descriptive metadata ──────────────────────────────────────
client.update_model_version(
name=MODEL_NAME,
version=version,
description="XGBoost v3; trained on 2024-11 data; AUC=0.972, F1=0.891",
)
client.set_model_version_tag(MODEL_NAME, version, "dataset_version", "v2024-11")
client.set_model_version_tag(MODEL_NAME, version, "approved_by", "akumar")
# ── Assign Champion alias (MLflow 2.x preferred approach) ─────────
client.set_registered_model_alias(MODEL_NAME, "challenger", version)
# After shadow / A/B testing passes — promote challenger to champion
client.set_registered_model_alias(MODEL_NAME, "champion", version)
# ── Load by alias in serving code ─────────────────────────────────
champion_model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}@champion")
# ── Searching registered models ───────────────────────────────────
for mv in client.search_model_versions(f"name='{MODEL_NAME}'"):
print(f" v{mv.version}: {mv.current_stage} | {mv.description[:50]}")
# ── Legacy stage transitions (MLflow 1.x / 2.x compat) ───────────
client.transition_model_version_stage(
name=MODEL_NAME,
version=version,
stage="Production",
archive_existing_versions=True, # archives previous Production version
)
# ── Get current production version URI for serving ────────────────
prod_versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
prod_uri = f"models:/{MODEL_NAME}/{prod_versions[0].version}"
print(f"Production model URI: {prod_uri}")
☁️ Cloud Registries
All major cloud providers offer managed model registries deeply integrated with their ML platforms. For teams already committed to a cloud provider, these offer better security, audit logging, and integration with managed training and serving infrastructure than self-hosted MLflow.
| Platform | Registry Product | Key Integration | Notes |
|---|---|---|---|
| AWS | SageMaker Model Registry | SageMaker Pipelines, Endpoints, CodePipeline CI/CD | Model groups with approval workflows; integrates with IAM for fine-grained access control; supports multi-account deployment |
| Google Cloud | Vertex AI Model Registry | Vertex AI Pipelines, Endpoints, Feature Store | Container-first — stores Docker image URIs; evaluation metrics tracked per version; direct integration with Vertex Explainability |
| Microsoft Azure | Azure ML Model Registry | Azure ML Pipelines, Online/Batch Endpoints, Responsible AI dashboard | First-class model card support; deep integration with Azure DevOps; supports MLflow models natively |
| HuggingFace Hub | HuggingFace Model Hub | Transformers, Diffusers, PEFT, Inference API | Best for open model sharing; version via git-based repo; model cards are first-class; private repos require paid plan; Spaces for demos |
| Databricks | Unity Catalog (Models) | MLflow 2.x, Delta Lake, Feature Store, Model Serving | Unified governance across data + models in one catalogue; fine-grained ACLs; lifecycle policies for automated archival |
Choosing Between Self-Hosted MLflow and Cloud Registry
If your team is cloud-native and uses SageMaker/Vertex/Azure ML for training and serving, the cloud registry is the natural choice — the integration is seamless. If you're multi-cloud, on-premises, or want to avoid vendor lock-in, self-hosted MLflow on Kubernetes is mature, free, and widely understood. The MLflow Python client works with all registries that expose the MLflow REST API, including Databricks Unity Catalog.
📁 Artifact Management
An artifact is any file output of a training run that needs to be preserved. Artifact management is the discipline of storing these files reliably, making them addressable by content (not just path), and ensuring that the right set of artifacts is always co-located when a model is deployed.
What Counts as an Artifact
The common mistake is to store only the model weights and assume everything else can be reconstructed. This breaks deployment:
- Model weights — pickle, ONNX, TorchScript, SavedModel, safetensors
- Preprocessing pipeline — StandardScaler, OrdinalEncoder, imputer fitted on training data
- Tokenizer — for NLP models; vocabulary and special tokens are model-specific
- Feature configuration — which columns, in what order, what dtype
- Threshold configuration — classification cutoff, confidence threshold
- Evaluation reports — confusion matrix, PR curve, per-slice metrics as CSV/HTML
- Model card — documentation as a YAML or Markdown artifact
Artifact Stores
An artifact store is the backing storage where actual file bytes live. The registry stores pointers; the artifact store holds the content:
- Amazon S3 — most common; use versioning + lifecycle policies
- Google Cloud Storage — tight Vertex AI integration
- Azure Blob Storage — pairs with Azure ML Registry
- MLflow managed artifacts — can back onto any of the above
- Local NFS — on-prem option; use a distributed filesystem for HA
Content Addressing vs Path Addressing
Path addressing (s3://bucket/models/v3/model.pkl) is fragile — the file can be overwritten silently. Content addressing stores files by their hash:
- DVC uses MD5 content hashes: same content = same storage key
- MLflow artifact URIs include the run UUID, preventing collisions
- S3 Object Lock prevents overwrite for compliance
- Content-addressed storage enables deduplication across model versions
- Enables integrity verification: re-hash at load time to detect corruption
Always Store the Preprocessing Pipeline With the Model
The most common source of training-serving skew is a mismatch between the preprocessing applied at training time and the preprocessing applied at inference time. If your scaler was fit on training data with mean=42.3 and std=8.1, but the serving code creates a new scaler or hardcodes different values, your model's inputs will be out of distribution and performance will silently degrade. The solution: serialise the fitted preprocessor as part of the same model artifact bundle, and always load them together.
Artifact Lifecycle
| Stage | Retention Policy | Storage Tier | Reason |
|---|---|---|---|
| Active experiment runs (last 30 days) | Retain all | S3 Standard | Frequent comparison and iteration |
| Staged models | Retain while in Staging + 90 days after archival | S3 Standard | Active evaluation and shadow testing |
| Production models (current + 2 prior) | Retain indefinitely | S3 Standard | Rollback capability and compliance |
| Archived models (older than 1 year) | Retain 3 years (or per regulation) | S3 Glacier Instant | Compliance audit at low cost |
| Failed / abandoned runs | Delete after 60 days | S3 Standard-IA | Cost management |
📋 Model Cards & Documentation
A Model Card (Mitchell et al., 2019) is a short structured document attached to a model that describes its intended uses, performance characteristics, and limitations. Originally a research proposal, model cards are now a regulatory expectation in financial services (EU AI Act, SR 11-7), healthcare, and hiring — and a best practice everywhere else. A model without a card is an undocumented system in production.
Model Card Sections (Mitchell et al.)
- Model details — name, version, type, training date, contact
- Intended use — primary use case, intended users, out-of-scope uses
- Factors — demographic groups, environmental conditions considered
- Metrics — performance measures used and why they were chosen
- Evaluation data — datasets used for evaluation and their properties
- Training data — summary of training data (privacy permitting)
- Quantitative analyses — per-group performance disaggregation
- Ethical considerations — risks, mitigations, limitations
- Caveats and recommendations — edge cases, update triggers
Why Regulators Are Requiring Them
The EU AI Act (effective August 2026) mandates technical documentation for high-risk AI systems that must be kept updated throughout the model's lifecycle.
- EU AI Act Article 11 — technical documentation including training data description and performance metrics
- US NIST AI RMF — model cards as a governance artefact in the "Govern" function
- SR 11-7 (banking) — model validation documentation requirements effectively mandate model cards
- GDPR Article 22 — right to explanation for automated decisions
- FDA guidance — software as a medical device requires performance characterisation per subgroup
HuggingFace Model Card Format
HuggingFace established a widely adopted YAML-frontmatter format for model cards that tooling can parse. The metadata section drives model discoverability in the Hub.
- YAML frontmatter: language, license, datasets, metrics, tags
- Markdown body: free-form sections following Mitchell et al.
huggingface_hub.ModelCardPython class for programmatic creation- Automatic render in Hub UI with metric tables and badges
- Training metadata auto-populated by Trainer callback
Model Card YAML Template
# model_card.yaml
# Store this as an artifact in the model registry alongside the weights
model_details:
name: "fraud-transaction-detector"
version: "3.2.1"
type: "XGBoost binary classifier"
description: >
Detects fraudulent credit card transactions in real time.
Trained on anonymised transaction logs from 2022-2024.
license: "Proprietary — internal use only"
contact: "[email protected]"
training_date: "2024-11-15"
framework: "XGBoost 2.0.3 / scikit-learn 1.4.0"
registry_uri: "models:/fraud-detector@champion"
intended_use:
primary_use: "Real-time fraud scoring at payment authorisation time"
intended_users:
- "Fraud operations analysts"
- "Payment processing pipeline (automated)"
out_of_scope:
- "Account takeover detection (different model)"
- "Transaction amounts > $50,000 (insufficient training data)"
- "Cryptocurrency transactions"
training_data:
description: "Anonymised Visa/Mastercard transactions, US market, 2022-01 to 2024-10"
size: "142M transactions (0.18% positive class)"
preprocessing: "SMOTE oversampling; StandardScaler on amount; frequency encoding on merchant_id"
note: "Full schema cannot be disclosed for privacy reasons"
evaluation_data:
description: "Held-out 20% split, stratified by fraud rate, 2024-09 to 2024-10"
size: "28.4M transactions"
metrics:
overall:
roc_auc: 0.9724
precision_at_0.5: 0.8312
recall_at_0.5: 0.7644
f1_at_0.5: 0.7964
false_positive_rate: 0.0021
per_group:
card_type:
visa: { roc_auc: 0.9731, f1: 0.7998 }
master: { roc_auc: 0.9708, f1: 0.7901 }
transaction_type:
card_present: { roc_auc: 0.9688, f1: 0.7812 }
card_not_present: { roc_auc: 0.9751, f1: 0.8043 }
limitations:
- "Performance degrades for merchants with < 100 historical transactions"
- "Not calibrated for transaction amounts > $10,000"
- "Assumes feature pipeline version >= 2.4.0; older pipelines produce different feature values"
- "May underperform during major retail events (Black Friday) — consider threshold adjustment"
ethical_considerations:
- concern: "Geographic disparity"
mitigation: "Per-region performance monitoring; alert if regional FPR diverges by > 0.5%"
- concern: "Fraud pattern evolution"
mitigation: "Monthly PSI checks on input features; retrain trigger if PSI > 0.2"
caveats:
- "Retrain when monthly fraud rate changes by > 15% relative to training baseline"
- "Model card must be updated before any Production transition"
- "Threshold may need recalibration after any retraining event"
model_card_version: "1.0"
last_updated: "2024-11-15"
approved_by: "akumar"
Automating Model Card Generation
Manually written model cards become stale. The better approach: generate the metrics section programmatically from the evaluation pipeline, then populate it into the YAML template before registering the model. This way the card is always accurate — a human writes the context sections once, and the pipeline fills in the numbers. Store the card as an MLflow artifact alongside the weights so it travels with the model wherever it goes.