📦 Why Version ML Models
Machine learning has a reproducibility problem. A model trained last Tuesday may produce different results today — even from identical source code — because of silent changes in data, library updates, or hardware differences. Versioning is the discipline of recording every input that determined a model's behaviour so you can reproduce, compare, audit, and roll back with confidence.
What Changes Between Runs
A model is the product of far more than its weights file. Six distinct dimensions can differ between any two training runs:
- Training data — rows added, labels corrected, splits reshuffled
- Feature engineering code — normalisation logic, missing-value treatment
- Hyperparameters — learning rate, depth, regularisation strength
- Library versions — PyTorch 2.1 vs 2.3 can yield different numerical results
- Random seeds — weight initialisation, data shuffle, dropout masks
- Hardware / CUDA version — GPU non-determinism in floating-point operations
A Model Is Not Just Weights
When you save model.pkl, you have captured the weights — but not the full model object. A production-ready model version must include:
- The preprocessing pipeline (scaler, encoder, imputer)
- The exact training dataset fingerprint (hash or DVC pointer)
- The git commit SHA of the training code
- The conda/pip environment specification
- Performance metrics on a held-out evaluation set
- The hyperparameter configuration used
Why Versioning Matters in Practice
Beyond academic reproducibility, versioning solves real operational problems that arise as soon as models reach production:
- Rollback — new version degrades KPIs; instantly revert to the last known-good version
- Compliance & auditing — regulators may ask which model made a credit decision in Q2 2024
- Debugging — compare two model versions to isolate what changed and why performance shifted
- Collaboration — multiple researchers experiment in parallel without overwriting each other's results
- Cost attribution — understand which experiment consumed how much compute
The Reproducibility Crisis in ML
A 2019 survey of NeurIPS and ICML papers found that fewer than 15% of published results could be independently reproduced. In industry settings, teams regularly cannot reproduce their own models three months later. Without disciplined versioning, every "improvement" is built on a foundation you cannot verify.
📊 Experiment Tracking
Experiment tracking answers the question: "Of all the runs I've done, which one should I promote to production, and why?" A tracking system logs every configurable aspect of a run and provides a UI or API for comparing runs at scale.
What to Log in Every Run
Parameters
Inputs you controlled before training began. These are scalar or small config values — not the data itself.
- Learning rate, batch size, number of epochs
- Model architecture (layers, hidden units, dropout rate)
- Regularisation coefficients (L1/L2 lambda)
- Feature selection decisions, preprocessing flags
- Random seed values (global and per-library)
Metrics
Outputs you measure during or after training. Good tracking records metrics at each step so you can plot learning curves, not just final values.
- Training loss per batch/epoch
- Validation loss and primary task metric (AUC, F1, RMSE)
- Evaluation on held-out test set after training ends
- Business proxy metrics (precision@K, revenue proxy)
- Training time, GPU memory usage, carbon footprint
Artifacts & Environment
The heavyweight or non-scalar outputs that must be stored by reference rather than value.
- Model weights (pickle, ONNX, SavedModel)
- Preprocessing pipeline object
- Confusion matrix, PR curve plots
- Git commit SHA of training script
pip freeze/ conda env YAML- DVC pointer to training dataset version
Tool Comparison
| Tool | Hosting | Key Features | Free Tier |
|---|---|---|---|
| MLflow | Self-hosted (open source); Databricks managed | Runs, experiments, model registry, serving, autolog, plugin ecosystem | Fully open source; Databricks Community free |
| Weights & Biases | SaaS (wandb.ai); on-prem available | Real-time charts, hyperparameter sweeps, artifact versioning, collaboration, reports | Free for individuals; 100 GB storage |
| Neptune.ai | SaaS; self-hosted Enterprise | Metadata store, comparison tables, integrations with 25+ frameworks | Free tier: 200 hours compute monitoring |
| Comet ML | SaaS; on-prem | Code diff between runs, data panels, model production monitoring | Free for open-source and academic |
| ClearML | Self-hosted or SaaS | Full MLOps platform: experiment tracking, pipelines, data management, orchestration | Open source community edition |
Weights & Biases Sweeps
W&B Sweeps automate hyperparameter search. You define a sweep config (search space + strategy: grid, random, or Bayesian), and W&B agents distributed across machines pull the next config to try. All runs are automatically grouped, and W&B surfaces the best configuration with importance analysis showing which hyperparameters mattered most.
🔬 MLflow Deep Dive
MLflow is the most widely deployed open-source experiment tracking system. It consists of four components: Tracking (logging API + UI), Projects (reproducible code packaging), Models (generic model format), and Model Registry (lifecycle management covered in the next page). The Tracking component is where most practitioners spend 90% of their time.
Core Tracking Concepts
- Experiment — a named group of runs, typically one per project or task
- Run — a single execution of your training script with specific params
- Run ID — a UUID that uniquely identifies a run for reproducibility
- mlruns/ — local directory where tracking data is stored by default
- Tracking Server — optional remote server backed by a DB + artifact store
Autolog — Zero-Code Instrumentation
MLflow's autolog intercepts framework callbacks to log params, metrics, and models with a single line. Supported frameworks include:
- scikit-learn — params, CV results, feature importance
- PyTorch Lightning — epoch metrics, model checkpoints
- TensorFlow/Keras — epoch metrics, model summary
- XGBoost / LightGBM — tree counts, feature importance
- Transformers (HuggingFace) — trainer metrics, tokenizer
MLflow UI Overview
Launch with mlflow ui (default port 5000). The UI provides:
- Experiment list with run counts and last-modified timestamps
- Run comparison table — sort by any metric
- Parallel coordinates chart for hyperparameter exploration
- Metric history charts (per-step curves)
- Artifact browser — view plots, model files, configs inline
- Model Registry tab for lifecycle management
Full MLflow Experiment Logging Example
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.datasets import load_breast_cancer
import numpy as np
# ── Configuration ──────────────────────────────────────────────────
EXPERIMENT_NAME = "breast-cancer-classification"
RANDOM_SEED = 42
params = {
"n_estimators": 200,
"max_depth": 8,
"min_samples_split": 5,
"max_features": "sqrt",
"random_state": RANDOM_SEED,
}
# ── Data ───────────────────────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)
# ── MLflow Setup ───────────────────────────────────────────────────
# Point at a remote tracking server instead of local ./mlruns:
# mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment(EXPERIMENT_NAME)
with mlflow.start_run(run_name="rf-baseline-v1") as run:
# Log every hyperparameter before training
mlflow.log_params(params)
# Log environment metadata as tags (searchable, not plotted)
mlflow.set_tags({
"author": "akumar",
"dataset_version": "v2024-11",
"git_commit": "a3f9c12", # subprocess.check_output(["git","rev-parse","HEAD"])
"framework": "scikit-learn",
})
# ── Training ───────────────────────────────────────────────────
clf = RandomForestClassifier(**params)
clf.fit(X_train, y_train)
# ── Evaluation ─────────────────────────────────────────────────
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"roc_auc": roc_auc_score(y_test, y_prob),
"f1_macro": f1_score(y_test, y_pred, average="macro"),
}
mlflow.log_metrics(metrics)
# Log feature importances as an artifact (CSV)
import pandas as pd
fi = pd.DataFrame({
"feature": load_breast_cancer().feature_names,
"importance": clf.feature_importances_,
}).sort_values("importance", ascending=False)
fi.to_csv("/tmp/feature_importance.csv", index=False)
mlflow.log_artifact("/tmp/feature_importance.csv", artifact_path="reports")
# Log the model — also registers input/output schema for serving
signature = mlflow.models.infer_signature(X_train, clf.predict(X_train))
mlflow.sklearn.log_model(
clf,
artifact_path="model",
signature=signature,
registered_model_name="breast-cancer-rf", # auto-registers in Registry
)
print(f"Run ID: {run.info.run_id}")
print(f"ROC-AUC: {metrics['roc_auc']:.4f}")
# ── Querying runs programmatically ─────────────────────────────────
runs = mlflow.search_runs(
experiment_names=[EXPERIMENT_NAME],
filter_string="metrics.roc_auc > 0.95",
order_by=["metrics.roc_auc DESC"],
)
print(runs[["run_id", "params.n_estimators", "metrics.roc_auc"]].head())
Production Tracking Server Setup
For team use, run MLflow with a PostgreSQL backend store and an S3 artifact store: mlflow server --backend-store-uri postgresql://user:pass@host/mlflow --default-artifact-root s3://my-bucket/mlflow. This separates metadata (fast DB queries) from large artifacts (object storage), which is the right architecture for teams running thousands of experiments.
🗂️ Git for ML — DVC
Git was designed for text files. A 50 GB training dataset or a 7 GB model checkpoint will overwhelm any git repository. DVC (Data Version Control) extends git's paradigm to large files and ML pipelines: you store small pointer files in git while DVC manages the actual bytes in remote storage (S3, GCS, Azure Blob, SSH, etc.). The result is that checkout, branch, and merge semantics work the same way for data as they do for code.
DVC Core Concepts
- .dvc files — tiny pointer files committed to git; contain an MD5 hash and path
- Remote storage — configured per-repo; stores actual file bytes by content hash
- dvc add — tracks a large file, writes the .dvc pointer, gitignores the original
- dvc push/pull — sync files between local cache and remote storage
- dvc repro — re-run only the pipeline stages that are out of date
- dvc diff — compare datasets between git commits
DVC Pipelines
DVC pipelines (dvc.yaml) define a DAG of stages: each stage has deps (inputs), outs (outputs), and a command to run. DVC tracks file hashes to determine if a stage needs to be re-run, similar to Makefiles but with data awareness.
- Stages: prepare → featurize → train → evaluate
- Params tracked from a YAML config file
- Metrics and plots are first-class outputs
dvc dagrenders the pipeline graph- Integrates with CI/CD via GitHub Actions
Tagging Model Releases in Git
Complement DVC with git tags to mark production model releases:
git tag -a v1.2.0-model -m "AUC 0.962, trained on 2024-11 data"- Tag the commit whose .dvc file points to the production weights
- Use annotated tags (not lightweight) for release notes
- Push tags to origin:
git push origin --tags - Rollback =
git checkout v1.1.0-model && dvc pull
DVC Workflow: Dataset + Model Versioning
# ── Initial DVC setup ──────────────────────────────────────────────
git init my-ml-project && cd my-ml-project
dvc init # creates .dvc/ config directory
git add .dvc && git commit -m "init dvc"
# Configure remote storage (S3 example)
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc remote modify myremote region us-east-1
git add .dvc/config && git commit -m "add s3 remote"
# ── Tracking a dataset ─────────────────────────────────────────────
# Download / prepare your raw data
mkdir data && cp /mnt/nas/dataset_v1.csv data/
dvc add data/dataset_v1.csv # creates data/dataset_v1.csv.dvc
git add data/dataset_v1.csv.dvc data/.gitignore
git commit -m "track dataset v1 with DVC"
dvc push # uploads bytes to S3
# ── Defining a pipeline (dvc.yaml) ────────────────────────────────
cat > dvc.yaml << 'EOF'
stages:
prepare:
cmd: python src/prepare.py --input data/dataset_v1.csv --output data/prepared/
deps:
- src/prepare.py
- data/dataset_v1.csv
outs:
- data/prepared/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/prepared/
params:
- params.yaml:
- train.lr
- train.max_depth
outs:
- models/model.pkl
metrics:
- metrics/eval.json:
cache: false
EOF
# ── Running the pipeline ───────────────────────────────────────────
dvc repro # runs only stale stages
git add dvc.lock dvc.yaml metrics/eval.json
git commit -m "train: AUC 0.951, lr=0.01"
dvc push # push model artifact to S3
# ── Updating the dataset (v2) ──────────────────────────────────────
cp /mnt/nas/dataset_v2.csv data/dataset_v1.csv # in-place update
dvc add data/dataset_v1.csv # updates the .dvc file hash
git add data/dataset_v1.csv.dvc
git commit -m "track dataset v2 (Nov 2024 batch)"
dvc repro # automatically re-trains with new data
# ── Rollback to v1 model ───────────────────────────────────────────
git checkout HEAD~2 -- data/dataset_v1.csv.dvc models/model.pkl.dvc
dvc pull # fetches the v1 bytes from S3
# Your working directory now has the v1 model and dataset
Git-LFS vs DVC
Git LFS (Large File Storage) is simpler to set up but treats large files as opaque blobs — it doesn't understand ML pipelines, can't diff datasets semantically, and has no concept of stages or experiments. DVC is the right choice for ML projects because it understands the relationship between data, code, params, and model output as a first-class concept.
✅ Reproducibility Best Practices
Reproducibility is not binary — it exists on a spectrum from "roughly similar results" to "bit-for-bit identical outputs." For most ML teams, the practical goal is statistically reproducible: re-running a training script on the same data and code produces results within the expected random variation of the training process. The following practices close the gap between "it worked once" and "it always works."
Pin All Dependency Versions
Float dependencies are the silent killer of reproducibility. A library author pushes a bug fix that changes numerical behaviour and your model silently changes.
- Generate
pip freeze > requirements.txtafter every successful experiment - Pin exact versions:
torch==2.3.1nottorch>=2.0 - Use
conda env export > environment.ymlfor CUDA/cuDNN pinning - Consider Poetry or uv for deterministic lock files
- Log the resolved environment in every MLflow run as an artifact
Set Random Seeds Everywhere
Many sources of randomness must be seeded independently — setting just Python's random seed is not enough:
random.seed(42)— Python stdlibnp.random.seed(42)— NumPy operationstorch.manual_seed(42)— PyTorch CPU opstorch.cuda.manual_seed_all(42)— GPU opstorch.backends.cudnn.deterministic = True- Set
PYTHONHASHSEED=42environment variable
Docker for Environment Capture
A Docker image is the ultimate reproducibility artefact — it captures the OS, system libraries, CUDA toolkit, Python version, and all packages at the byte level.
- Build a training image tagged with the git commit SHA
- Push image to a registry (ECR, GCR) so it's retrievable years later
- Store the image digest in MLflow run tags
- Use
--platform linux/amd64for cross-platform reproducibility - Multi-stage builds: separate train image from lean serving image
Log Hardware Specs
GPU architecture and driver version affect floating-point results at the bit level. Log hardware context as run metadata:
nvidia-smi --query-gpu=name,driver_version --format=csv- CUDA version:
torch.version.cuda - cuDNN version:
torch.backends.cudnn.version() - CPU model and core count for non-GPU training
- Available RAM — affects shuffle buffer sizes and caching
A Result You Cannot Reproduce Is a Result You Cannot Trust
If you cannot re-run an experiment and get results within the expected variance of the stochastic training process, that result has no scientific or engineering value. Before deploying any model to production, a second engineer should be able to check out the tagged commit, run dvc pull && python train.py, and arrive at a model within 0.5% of the reported metric. If they cannot, the model is not production-ready regardless of its headline accuracy.
Reproducibility Checklist
| Category | Check | Tool | Priority |
|---|---|---|---|
| Data | Dataset version tracked with content hash | DVC | Critical |
| Data | Train/val/test splits deterministic (seeded) | scikit-learn, DVC | Critical |
| Code | Git commit SHA logged with every run | MLflow tags, W&B | Critical |
| Params | All hyperparameters logged (not hardcoded) | MLflow, config YAML | Critical |
| Environment | Exact pip/conda dependency versions pinned | pip freeze, Poetry | Critical |
| Randomness | All random seeds set and logged | Manual + MLflow | High |
| Environment | Docker image built and pushed with git SHA tag | Docker, ECR/GCR | High |
| Hardware | GPU/CUDA version logged as run metadata | MLflow tags | Medium |
| Artifacts | Preprocessing pipeline stored alongside weights | MLflow artifacts | Critical |