Model Registry & Artifacts | MLOps Series

🗄️ What Is a Model Registry

A model registry is the system of record for trained models. It sits between your experiment tracking system and your deployment infrastructure, answering the question: "Of all the model versions that exist, which one should be running in production right now — and who decided that?" A registry is not a file server. It enforces a lifecycle, tracks lineage, and gates promotion through defined stages.

Lifecycle Stages

Every model version in a registry occupies exactly one lifecycle stage at any point in time. Transitions are logged with timestamps and the identity of the approver.

None / Candidate — registered from an experiment run, not yet evaluated for production
Staging — passing automated quality gates, undergoing human review or shadow testing
Production — the canonical serving version; may have multiple versions in some systems
Archived — retired from service; kept for auditing and potential rollback

Lineage Tracking

A registry entry is a pointer to a complete provenance chain, not just a weights file. Full lineage answers regulators, debuggers, and incident responders:

Which training data version (DVC hash / S3 path) produced this model?
Which git commit of training code was used?
Which experiment run (MLflow run ID) generated these weights?
Who approved the promotion to Production and when?
What evaluation metrics did this version achieve?

Approval Workflows

Production model updates should not be automatic — they require human sign-off, especially in regulated domains. A registry formalises this:

Automated gates: performance thresholds, bias checks, latency benchmarks
Human review: ML engineer approves Staging → Production transition
Notification hooks: Slack alerts on stage transitions
Audit log: immutable record of who approved what and when
Rollback path: one-click revert to previous Production version

Without Registry vs With Registry

Concern	Without a Model Registry	With a Model Registry
Deployment tracking	Shared spreadsheet or institutional memory	Queryable database: which version is in prod right now
Rollback	Re-run training script, hope it reproduces	Instant: transition previous version back to Production
Compliance audit	Manual investigation across experiment logs	Complete lineage trace in seconds via API
Multi-team coordination	Email / Slack "hey I updated the model"	Notification webhooks, stage-change events
Model quality gate	Depends on individual discipline	Automated threshold checks block bad models
Artifact location	Hard-coded S3 paths in deploy scripts	Registry returns canonical URI for current Production version

🔧 MLflow Model Registry

MLflow's Model Registry is the most widely deployed open-source registry. It is tightly integrated with the MLflow Tracking system — you register a model directly from a run's artifact, and the registry maintains the full link back to the run's parameters, metrics, and metadata. As of MLflow 2.x, aliases (Champion/Challenger) replace the older stage terminology.

Model Versions & Aliases

Every registration creates a new immutable version number. Aliases are mutable pointers that deployment code can follow without hardcoding version numbers:

Version — immutable integer (1, 2, 3…); links back to the originating run
@champion — alias pointing to the currently best model
@challenger — alias for the new candidate being A/B tested
Serving code loads by alias: models:/fraud-detector@champion
Aliases can be reassigned atomically with no serving downtime

Serving from the Registry

MLflow can serve a registered model directly, or you can load it programmatically in your own serving layer:

CLI: mlflow models serve -m "models:/fraud-detector@champion"
Python: mlflow.sklearn.load_model("models:/fraud-detector/3")
Docker: mlflow models build-docker
Spark UDF for batch inference on DataFrames
Databricks Model Serving for production-grade autoscaling

MLflow Registry Architecture

The registry requires a backend store (SQL DB) and an artifact store (object storage). In production:

Backend store: PostgreSQL or MySQL (not SQLite — it's single-writer)
Artifact store: S3, GCS, Azure Blob, or NFS
Auth: Databricks-managed or self-hosted with reverse proxy + OIDC
Webhooks: trigger Jenkins/GitHub Actions on stage transitions
REST API: every action is available programmatically for automation

Register and Transition a Model in Python

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud-detector"

# ── Option A: Register during training (from log_model) ────────────
with mlflow.start_run() as run:
    # ... training code ...
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name=MODEL_NAME,   # auto-creates or adds version
    )
    run_id = run.info.run_id

# ── Option B: Register an existing run's artifact ─────────────────
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name=MODEL_NAME,
)
version = result.version
print(f"Registered as version {version}")

# ── Wait for registration to complete ─────────────────────────────
import time
for _ in range(10):
    mv = client.get_model_version(MODEL_NAME, version)
    if mv.status == "READY":
        break
    time.sleep(2)

# ── Add descriptive metadata ──────────────────────────────────────
client.update_model_version(
    name=MODEL_NAME,
    version=version,
    description="XGBoost v3; trained on 2024-11 data; AUC=0.972, F1=0.891",
)
client.set_model_version_tag(MODEL_NAME, version, "dataset_version", "v2024-11")
client.set_model_version_tag(MODEL_NAME, version, "approved_by", "akumar")

# ── Assign Champion alias (MLflow 2.x preferred approach) ─────────
client.set_registered_model_alias(MODEL_NAME, "challenger", version)

# After shadow / A/B testing passes — promote challenger to champion
client.set_registered_model_alias(MODEL_NAME, "champion", version)

# ── Load by alias in serving code ─────────────────────────────────
champion_model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}@champion")

# ── Searching registered models ───────────────────────────────────
for mv in client.search_model_versions(f"name='{MODEL_NAME}'"):
    print(f"  v{mv.version}: {mv.current_stage} | {mv.description[:50]}")

# ── Legacy stage transitions (MLflow 1.x / 2.x compat) ───────────
client.transition_model_version_stage(
    name=MODEL_NAME,
    version=version,
    stage="Production",
    archive_existing_versions=True,   # archives previous Production version
)

# ── Get current production version URI for serving ────────────────
prod_versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
prod_uri = f"models:/{MODEL_NAME}/{prod_versions[0].version}"
print(f"Production model URI: {prod_uri}")

☁️ Cloud Registries

All major cloud providers offer managed model registries deeply integrated with their ML platforms. For teams already committed to a cloud provider, these offer better security, audit logging, and integration with managed training and serving infrastructure than self-hosted MLflow.

Platform	Registry Product	Key Integration	Notes
AWS	SageMaker Model Registry	SageMaker Pipelines, Endpoints, CodePipeline CI/CD	Model groups with approval workflows; integrates with IAM for fine-grained access control; supports multi-account deployment
Google Cloud	Vertex AI Model Registry	Vertex AI Pipelines, Endpoints, Feature Store	Container-first — stores Docker image URIs; evaluation metrics tracked per version; direct integration with Vertex Explainability
Microsoft Azure	Azure ML Model Registry	Azure ML Pipelines, Online/Batch Endpoints, Responsible AI dashboard	First-class model card support; deep integration with Azure DevOps; supports MLflow models natively
HuggingFace Hub	HuggingFace Model Hub	Transformers, Diffusers, PEFT, Inference API	Best for open model sharing; version via git-based repo; model cards are first-class; private repos require paid plan; Spaces for demos
Databricks	Unity Catalog (Models)	MLflow 2.x, Delta Lake, Feature Store, Model Serving	Unified governance across data + models in one catalogue; fine-grained ACLs; lifecycle policies for automated archival

Choosing Between Self-Hosted MLflow and Cloud Registry

If your team is cloud-native and uses SageMaker/Vertex/Azure ML for training and serving, the cloud registry is the natural choice — the integration is seamless. If you're multi-cloud, on-premises, or want to avoid vendor lock-in, self-hosted MLflow on Kubernetes is mature, free, and widely understood. The MLflow Python client works with all registries that expose the MLflow REST API, including Databricks Unity Catalog.

📁 Artifact Management

An artifact is any file output of a training run that needs to be preserved. Artifact management is the discipline of storing these files reliably, making them addressable by content (not just path), and ensuring that the right set of artifacts is always co-located when a model is deployed.

What Counts as an Artifact

The common mistake is to store only the model weights and assume everything else can be reconstructed. This breaks deployment:

Model weights — pickle, ONNX, TorchScript, SavedModel, safetensors
Preprocessing pipeline — StandardScaler, OrdinalEncoder, imputer fitted on training data
Tokenizer — for NLP models; vocabulary and special tokens are model-specific
Feature configuration — which columns, in what order, what dtype
Threshold configuration — classification cutoff, confidence threshold
Evaluation reports — confusion matrix, PR curve, per-slice metrics as CSV/HTML
Model card — documentation as a YAML or Markdown artifact

Artifact Stores

An artifact store is the backing storage where actual file bytes live. The registry stores pointers; the artifact store holds the content:

Amazon S3 — most common; use versioning + lifecycle policies
Google Cloud Storage — tight Vertex AI integration
Azure Blob Storage — pairs with Azure ML Registry
MLflow managed artifacts — can back onto any of the above
Local NFS — on-prem option; use a distributed filesystem for HA

Content Addressing vs Path Addressing

Path addressing (s3://bucket/models/v3/model.pkl) is fragile — the file can be overwritten silently. Content addressing stores files by their hash:

DVC uses MD5 content hashes: same content = same storage key
MLflow artifact URIs include the run UUID, preventing collisions
S3 Object Lock prevents overwrite for compliance
Content-addressed storage enables deduplication across model versions
Enables integrity verification: re-hash at load time to detect corruption

Always Store the Preprocessing Pipeline With the Model

The most common source of training-serving skew is a mismatch between the preprocessing applied at training time and the preprocessing applied at inference time. If your scaler was fit on training data with mean=42.3 and std=8.1, but the serving code creates a new scaler or hardcodes different values, your model's inputs will be out of distribution and performance will silently degrade. The solution: serialise the fitted preprocessor as part of the same model artifact bundle, and always load them together.

Artifact Lifecycle

Stage	Retention Policy	Storage Tier	Reason
Active experiment runs (last 30 days)	Retain all	S3 Standard	Frequent comparison and iteration
Staged models	Retain while in Staging + 90 days after archival	S3 Standard	Active evaluation and shadow testing
Production models (current + 2 prior)	Retain indefinitely	S3 Standard	Rollback capability and compliance
Archived models (older than 1 year)	Retain 3 years (or per regulation)	S3 Glacier Instant	Compliance audit at low cost
Failed / abandoned runs	Delete after 60 days	S3 Standard-IA	Cost management

📋 Model Cards & Documentation

A Model Card (Mitchell et al., 2019) is a short structured document attached to a model that describes its intended uses, performance characteristics, and limitations. Originally a research proposal, model cards are now a regulatory expectation in financial services (EU AI Act, SR 11-7), healthcare, and hiring — and a best practice everywhere else. A model without a card is an undocumented system in production.

Model Card Sections (Mitchell et al.)

Model details — name, version, type, training date, contact
Intended use — primary use case, intended users, out-of-scope uses
Factors — demographic groups, environmental conditions considered
Metrics — performance measures used and why they were chosen
Evaluation data — datasets used for evaluation and their properties
Training data — summary of training data (privacy permitting)
Quantitative analyses — per-group performance disaggregation
Ethical considerations — risks, mitigations, limitations
Caveats and recommendations — edge cases, update triggers

Why Regulators Are Requiring Them

The EU AI Act (effective August 2026) mandates technical documentation for high-risk AI systems that must be kept updated throughout the model's lifecycle.

EU AI Act Article 11 — technical documentation including training data description and performance metrics
US NIST AI RMF — model cards as a governance artefact in the "Govern" function
SR 11-7 (banking) — model validation documentation requirements effectively mandate model cards
GDPR Article 22 — right to explanation for automated decisions
FDA guidance — software as a medical device requires performance characterisation per subgroup

HuggingFace Model Card Format

HuggingFace established a widely adopted YAML-frontmatter format for model cards that tooling can parse. The metadata section drives model discoverability in the Hub.

YAML frontmatter: language, license, datasets, metrics, tags
Markdown body: free-form sections following Mitchell et al.
huggingface_hub.ModelCard Python class for programmatic creation
Automatic render in Hub UI with metric tables and badges
Training metadata auto-populated by Trainer callback

Model Card YAML Template

# model_card.yaml
# Store this as an artifact in the model registry alongside the weights

model_details:
  name: "fraud-transaction-detector"
  version: "3.2.1"
  type: "XGBoost binary classifier"
  description: >
    Detects fraudulent credit card transactions in real time.
    Trained on anonymised transaction logs from 2022-2024.
  license: "Proprietary — internal use only"
  contact: "[email protected]"
  training_date: "2024-11-15"
  framework: "XGBoost 2.0.3 / scikit-learn 1.4.0"
  registry_uri: "models:/fraud-detector@champion"

intended_use:
  primary_use: "Real-time fraud scoring at payment authorisation time"
  intended_users:
    - "Fraud operations analysts"
    - "Payment processing pipeline (automated)"
  out_of_scope:
    - "Account takeover detection (different model)"
    - "Transaction amounts > $50,000 (insufficient training data)"
    - "Cryptocurrency transactions"

training_data:
  description: "Anonymised Visa/Mastercard transactions, US market, 2022-01 to 2024-10"
  size: "142M transactions (0.18% positive class)"
  preprocessing: "SMOTE oversampling; StandardScaler on amount; frequency encoding on merchant_id"
  note: "Full schema cannot be disclosed for privacy reasons"

evaluation_data:
  description: "Held-out 20% split, stratified by fraud rate, 2024-09 to 2024-10"
  size: "28.4M transactions"

metrics:
  overall:
    roc_auc: 0.9724
    precision_at_0.5: 0.8312
    recall_at_0.5: 0.7644
    f1_at_0.5: 0.7964
    false_positive_rate: 0.0021

  per_group:
    card_type:
      visa:   { roc_auc: 0.9731, f1: 0.7998 }
      master:  { roc_auc: 0.9708, f1: 0.7901 }
    transaction_type:
      card_present:     { roc_auc: 0.9688, f1: 0.7812 }
      card_not_present: { roc_auc: 0.9751, f1: 0.8043 }

limitations:
  - "Performance degrades for merchants with < 100 historical transactions"
  - "Not calibrated for transaction amounts > $10,000"
  - "Assumes feature pipeline version >= 2.4.0; older pipelines produce different feature values"
  - "May underperform during major retail events (Black Friday) — consider threshold adjustment"

ethical_considerations:
  - concern: "Geographic disparity"
    mitigation: "Per-region performance monitoring; alert if regional FPR diverges by > 0.5%"
  - concern: "Fraud pattern evolution"
    mitigation: "Monthly PSI checks on input features; retrain trigger if PSI > 0.2"

caveats:
  - "Retrain when monthly fraud rate changes by > 15% relative to training baseline"
  - "Model card must be updated before any Production transition"
  - "Threshold may need recalibration after any retraining event"

model_card_version: "1.0"
last_updated: "2024-11-15"
approved_by: "akumar"

Automating Model Card Generation

Manually written model cards become stale. The better approach: generate the metrics section programmatically from the evaluation pipeline, then populate it into the YAML template before registering the model. This way the card is always accurate — a human writes the context sections once, and the pipeline fills in the numbers. Store the card as an MLflow artifact alongside the weights so it travels with the model wherever it goes.