Model Deployment & Inference

Taking a trained model from notebook to production

MLOps Serving Docker Monitoring Model Drift
ML Fundamentals Series
⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025

Model Serialization

Before a model can be deployed, it must be serialized — saved from its in-memory representation to a persistent format on disk or in object storage. The choice of serialization format affects portability, security, performance, and cross-language compatibility.

Pickle
.pkl / .pickle
Python's native serialization. Works for any Python object. Quick to use but Python/version-specific. Widely used for scikit-learn models.
Joblib
.joblib
Scikit-learn's preferred format. Efficient for large NumPy arrays. Supports memory-mapped loading for large models. Python-only.
ONNX
.onnx
Open Neural Network Exchange. Cross-framework, cross-language. Export PyTorch/TF, run in C++, Java, JavaScript, or any ONNX Runtime. Optimized for inference.
SavedModel
directory
TensorFlow/Keras native format. Saves weights, graph, and signatures. Deployable via TF Serving. Includes preprocessing functions in the saved artifact.
TorchScript
.pt
PyTorch's production format. Compiles model to intermediate representation. Runs without Python via libtorch in C++ or mobile runtimes.
PMML
.pmml
XML-based format for classical ML models. Supported by many enterprise platforms. Human-readable. Good for regulatory/auditability use cases.
HuggingFace
safetensors
Modern format for LLMs and transformers. Safer than pickle, supports memory-mapping. Becoming the standard for sharing large language models.
Core ML
.mlmodel
Apple's on-device ML format. Optimized for iOS/macOS Neural Engine. Convert from PyTorch or scikit-learn with Core ML Tools.

Security Warning: Never Load Untrusted Pickle Files

Python pickle files can execute arbitrary code when deserialized. Loading a pickle file from an untrusted source is equivalent to running a shell script from that source. Never unpickle model files downloaded from public repositories without verification. For sharing models publicly, use ONNX, SafeTensors, or PMML — formats that cannot execute arbitrary code during loading.

Format Selection Guide

ScenarioRecommended FormatKey Reason
Internal scikit-learn modelJoblibEfficient for numpy arrays; simple; Python ecosystem
Cross-language / production deploymentONNXLanguage-agnostic; hardware-optimized inference runtimes
TensorFlow production servingSavedModel + TF ServingNative format; supports batching, versioning, warm-up
PyTorch production (C++ backend)TorchScriptRuns without Python interpreter
Sharing LLMs publiclySafeTensorsSecurity; fast loading; HuggingFace ecosystem
Mobile / edge deploymentTFLite / Core ML / ONNX MobileOptimized for limited compute and memory

Serving Patterns

How you serve predictions depends on latency requirements, throughput needs, compute budget, and where the request originates. The four primary patterns span a spectrum from real-time to deferred to embedded inference.

REST API (Online Serving)

Expose the model as an HTTP endpoint. Clients send a JSON payload with input features; the server responds with predictions synchronously. This is the most common production pattern for interactive applications requiring low-latency responses.

Frameworks: FastAPI (Python, async, auto OpenAPI docs), Flask (simple, synchronous), TF Serving (TensorFlow-native, gRPC + HTTP), Triton Inference Server (NVIDIA, multi-framework, batching), Ray Serve (Python, scalable), BentoML (packaging + serving).

Batch Inference

Process large volumes of data offline on a schedule (hourly, daily, weekly). Predictions are written to a database or file store where downstream systems consume them. No real-time latency requirement.

Ideal when predictions can be precomputed: e-commerce product scoring, content recommendation pre-ranking, overnight risk scoring, fraud review queues.

Streaming Inference

Consume predictions as events flow through a data pipeline in near-real-time. Integrated with message queues (Kafka, Kinesis, Pub/Sub). The model is applied to each event as it arrives, with results published downstream.

Used for: fraud detection on payment events, real-time content moderation, IoT anomaly detection, live ad bidding pipelines.

Edge / On-Device Deployment

Run the model entirely on the user's device (phone, browser, embedded system). No network round-trip; predictions are private; works offline. Requires aggressive model compression.

Used for: keyboard next-word prediction, on-device wake word detection (Alexa/Siri), autonomous vehicle perception, medical device diagnostics.

Infrastructure Options

Choosing where to run your model involves balancing cost, operational burden, latency, scalability, and team expertise. Options range from fully managed platforms to self-managed containers.

OptionDescriptionOps BurdenCostBest For
Docker + VM Package model + runtime in a container, run on a single VM High (you manage everything) Low Small teams, prototypes, self-hosted environments
Kubernetes (K8s) Orchestrate model containers across a cluster with auto-scaling, rolling updates High (cluster management) Moderate Large-scale production, multi-model serving, internal platforms
AWS SageMaker Fully managed ML platform — train, tune, deploy, monitor in one service Low High AWS shops wanting end-to-end managed ML lifecycle
Google Vertex AI GCP's managed ML platform with AutoML, custom training, online/batch prediction Low High GCP shops, BigQuery data, integration with GCP data stack
Azure ML Microsoft's managed ML platform with designer, pipelines, and model registry Low High Azure shops, enterprise environments with existing Microsoft contracts
Serverless Functions AWS Lambda, Google Cloud Functions — spin up on demand, no always-on server Very low Pay-per-request Low-frequency requests; lightweight models; cost-sensitive workloads
Managed Inference APIs Replicate, Modal, Banana, Runpod — deploy a Docker image, get an endpoint Minimal Per-second GPU billing GPU models without K8s overhead; fast prototyping

Docker Best Practices for ML Serving

Pin exact versions of all packages (Python, CUDA, framework) in your Dockerfile. Use multi-stage builds to separate the training environment from the lean serving image. Include a health-check endpoint. Set memory limits to prevent OOM crashes from taking down the host. Copy model weights into the image rather than fetching at runtime to make deployments deterministic.

Right-Sizing Your Infrastructure

Start simple. A FastAPI app in a single Docker container handles thousands of requests per second for small models. Don't add Kubernetes until you have real scaling needs that Docker Compose or a single VM cannot handle. Premature orchestration complexity kills productivity. Scale when the load demands it, not when the architecture anticipates it.

Monitoring in Production

Deploying a model is not the end of the work — it's the beginning of a maintenance responsibility. Real-world data evolves, user behavior changes, and the distribution your model was trained on diverges from what it sees in production. Without monitoring, you won't know when your model has silently degraded.

Data Drift

Definition: The statistical distribution of input features changes over time, even if the relationship between features and target remains the same.

Example: A demand forecasting model trained on pre-pandemic shopping patterns sees completely different seasonality patterns after the pandemic. The input features (day of week, promotions) are the same but their distributions shifted.

Detection: Track feature distribution statistics (mean, std, quantiles) over time. Use statistical tests like Kolmogorov-Smirnov or Population Stability Index (PSI) to flag significant shifts.

Concept Drift

Definition: The relationship between input features and the target variable changes over time. The model's learned mapping is no longer correct even on correctly distributed data.

Example: A fraud detection model trained before a new fraud technique emerged performs well on old patterns but misses the new attack vector — the meaning of "fraudulent" transaction has changed.

Detection: Monitor prediction accuracy on labeled data as labels become available (delayed ground truth). Track prediction score distributions for unexpected shifts.

Production Monitoring Metrics & Strategies

Metric CategoryWhat to TrackToolsAlert When
System Metrics Latency (p50/p95/p99), throughput (req/s), error rate, CPU/GPU/memory utilization Prometheus, Grafana, Datadog, CloudWatch p95 latency >2x baseline; error rate >1%
Prediction Metrics Output score distribution, class balance of predictions, null/invalid output rate Evidently AI, Arize, Fiddler, custom logging Score distribution PSI > 0.2; null rate increases
Input Data Quality Feature nulls, out-of-range values, schema violations, cardinality explosions Great Expectations, Evidently, custom validators Null rate increases by >5% on any feature
Business Metrics Click-through rate, conversion rate, revenue impact, downstream KPIs BI dashboards, A/B testing platform KPI degrades significantly from baseline
Ground Truth Accuracy Model accuracy as true labels become available (may be days/weeks delayed) Custom pipelines, Arize, WhyLabs Accuracy drops >5% from deployment baseline

MLOps Best Practices

MLOps applies DevOps principles to the ML lifecycle — bringing automation, reproducibility, and reliable delivery to a domain that has historically been artisanal and notebook-driven. Mature MLOps practices dramatically reduce the time from model development to reliable production deployment.

Model Versioning & Registry

Every model artifact should be versioned with full provenance: which training data, which code commit, which hyperparameters produced it. A model registry (MLflow, Weights & Biases, SageMaker Registry, Vertex Model Registry) stores these artifacts with metadata.

  • Tag models with training date, dataset version, and git commit hash
  • Maintain three stages: Staging, Production, Archived
  • Never overwrite a production model — create a new version and promote it
  • Store evaluation metrics alongside the artifact for comparison
  • Enable one-click rollback to any previous production version

CI/CD for Machine Learning

Automated pipelines that run on every commit to validate, evaluate, and (optionally) deploy model changes. Catches regressions before they reach production.

  • Run unit tests on data preprocessing and feature engineering code
  • Gate on model performance: new version must meet or beat the current production model's metrics
  • Automate retraining on fresh data using tools like Airflow, Kubeflow, or GitHub Actions
  • Build and push a new Docker image on every training run
  • Deploy to staging automatically; require manual approval for production promotion

A/B Testing & Shadow Mode

Before fully switching to a new model, validate it against the current production model on live traffic.

  • Shadow mode — New model runs in parallel, receiving the same traffic, but its predictions are logged, not served to users. Compare performance without any user exposure risk.
  • Canary release — Route 5–10% of live traffic to the new model. Monitor metrics before widening rollout.
  • A/B test — Randomize users between old and new model. Measure business metrics with statistical significance before fully switching.
  • Always define success criteria before starting an experiment

Rollback & Incident Response

Production models will occasionally fail. Having a fast, tested rollback procedure is as important as a good deployment procedure.

  • Maintain a "last known good" model in the registry — always deployable in <5 minutes
  • Implement feature flags or traffic controls to cut over instantly
  • Define model "circuit breaker" thresholds — auto-rollback if error rate exceeds threshold
  • Document a runbook for each model: how to rollback, who to notify, what to check first
  • Practice rollback drills to ensure the procedure actually works when under pressure

The Deployment Reality Check

The most sophisticated ML system is worthless if it can't be maintained in production. Prioritize: (1) correctness and safety of predictions, (2) observability so you know when something is wrong, (3) fast rollback capability, and (4) reproducible retraining pipelines. Fancy serving infrastructure is secondary to these fundamentals. Many successful production ML systems run as simple batch jobs or single-server REST APIs that are well-monitored and easily debugged.

Local LLM Inference Tools

Running large language models locally — on your own hardware — removes cloud costs, network latency, and data-privacy concerns. Three tools dominate the self-hosted LLM ecosystem. Each targets a different use case: Ollama optimises for ease of use, llama.cpp for CPU/resource-constrained environments, and vLLM for maximum GPU throughput in production.

Ollama →

The easiest way to run open-source LLMs. A single CLI downloads, manages, and serves quantized models with zero configuration. Exposes an OpenAI-compatible REST API on port 11434. Supports NVIDIA CUDA, Apple Metal, and AMD ROCm acceleration out of the box.

Best for: Developers, home labs, and anyone who wants a model running in under five minutes. Powers tools like Open WebUI, Dify, and LangChain local integrations.

llama.cpp →

A pure C/C++ inference engine that runs quantized LLMs with no Python dependency. Originally a LLaMA port, it now supports 50+ architectures. Defined the GGUF model format and supports partial GPU offloading — load as many layers as fit in VRAM, run the rest on CPU.

Best for: CPU-first deployments, embedded systems, servers without a discrete GPU, or scenarios where Python runtimes are unavailable.

vLLM →

A production inference engine from UC Berkeley built around PagedAttention — a KV-cache management technique that treats GPU memory like OS virtual memory, eliminating padding waste. Combined with continuous batching, vLLM achieves 10–24× higher throughput than naive HuggingFace serving.

Best for: Production deployments with sustained load, multi-user APIs, and workloads where GPU throughput matters. Requires NVIDIA GPU (≥16 GB VRAM recommended).

Choosing the Right Tool

CriterionOllamallama.cppvLLM
Setup difficulty⭐ EasiestModerate (compile)Moderate (GPU needed)
GPU required?No (accelerates if present)No (offloads if present)Yes (NVIDIA recommended)
macOS support✅ Full (Metal)✅ Full (Metal)⚠️ CPU only
Throughput (GPU)GoodGoodExcellent (PagedAttention)
CPU inferenceSupportedExcellent (AVX2)Limited
Model formatGGUF (auto-managed)GGUF (manual)HuggingFace (safetensors)
OpenAI API compat✅ Yes✅ Yes✅ Yes
Best forDev, home lab, desktopCPU servers, edge, embeddedProduction, high-load APIs