Model Deployment & Inference

Taking a trained model from notebook to production

MLOps Serving Docker Monitoring Model Drift

ML Fundamentals Series

Supervised vs. Unsupervised Training, Validation & Test Sets Model Evaluation & Metrics Overfitting & Underfitting Feature Engineering Model Deployment & Inference

← Feature Engineering

⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025

Model Serialization

Before a model can be deployed, it must be serialized — saved from its in-memory representation to a persistent format on disk or in object storage. The choice of serialization format affects portability, security, performance, and cross-language compatibility.

Pickle

.pkl / .pickle

Python's native serialization. Works for any Python object. Quick to use but Python/version-specific. Widely used for scikit-learn models.

Joblib

.joblib

Scikit-learn's preferred format. Efficient for large NumPy arrays. Supports memory-mapped loading for large models. Python-only.

ONNX

.onnx

Open Neural Network Exchange. Cross-framework, cross-language. Export PyTorch/TF, run in C++, Java, JavaScript, or any ONNX Runtime. Optimized for inference.

SavedModel

Security Warning: Never Load Untrusted Pickle Files

Python pickle files can execute arbitrary code when deserialized. Loading a pickle file from an untrusted source is equivalent to running a shell script from that source. Never unpickle model files downloaded from public repositories without verification. For sharing models publicly, use ONNX, SafeTensors, or PMML — formats that cannot execute arbitrary code during loading.

Format Selection Guide

Scenario	Recommended Format	Key Reason
Internal scikit-learn model	Joblib	Efficient for numpy arrays; simple; Python ecosystem
Cross-language / production deployment	ONNX	Language-agnostic; hardware-optimized inference runtimes
TensorFlow production serving	SavedModel + TF Serving	Native format; supports batching, versioning, warm-up
PyTorch production (C++ backend)	TorchScript	Runs without Python interpreter
Sharing LLMs publicly	SafeTensors	Security; fast loading; HuggingFace ecosystem
Mobile / edge deployment	TFLite / Core ML / ONNX Mobile	Optimized for limited compute and memory

Serving Patterns

How you serve predictions depends on latency requirements, throughput needs, compute budget, and where the request originates. The four primary patterns span a spectrum from real-time to deferred to embedded inference.

REST API (Online Serving)

Expose the model as an HTTP endpoint. Clients send a JSON payload with input features; the server responds with predictions synchronously. This is the most common production pattern for interactive applications requiring low-latency responses.

Frameworks: FastAPI (Python, async, auto OpenAPI docs), Flask (simple, synchronous), TF Serving (TensorFlow-native, gRPC + HTTP), Triton Inference Server (NVIDIA, multi-framework, batching), Ray Serve (Python, scalable), BentoML (packaging + serving).

Latency: 1–100ms target for real-time UX (search, recommendations, fraud scoring)
Requires model warming, connection pooling, and load balancing for production scale
Support graceful batching within the server to maximize GPU utilization
Add input validation, schema enforcement, and output sanitization at the API layer

Batch Inference

Process large volumes of data offline on a schedule (hourly, daily, weekly). Predictions are written to a database or file store where downstream systems consume them. No real-time latency requirement.

Ideal when predictions can be precomputed: e-commerce product scoring, content recommendation pre-ranking, overnight risk scoring, fraud review queues.

Throughput can be massively parallelized using Spark, Dask, or Ray
Much higher GPU utilization than online serving (large batches fill memory efficiently)
Simpler infrastructure — no always-on service needed; run as a cron job or pipeline step
Prediction staleness is a tradeoff — cached predictions may not reflect real-time state

Streaming Inference

Consume predictions as events flow through a data pipeline in near-real-time. Integrated with message queues (Kafka, Kinesis, Pub/Sub). The model is applied to each event as it arrives, with results published downstream.

Used for: fraud detection on payment events, real-time content moderation, IoT anomaly detection, live ad bidding pipelines.

Combines the freshness of online serving with the throughput of batch processing
Requires careful handling of out-of-order events and windowing semantics
Faust, Kafka Streams, Apache Flink, and Spark Structured Streaming are common platforms
Model updates require versioned consumers with careful rollout to avoid partition inconsistencies

Edge / On-Device Deployment

Run the model entirely on the user's device (phone, browser, embedded system). No network round-trip; predictions are private; works offline. Requires aggressive model compression.

Used for: keyboard next-word prediction, on-device wake word detection (Alexa/Siri), autonomous vehicle perception, medical device diagnostics.

Model formats: TFLite, Core ML, ONNX Mobile, WebAssembly + ONNX.js (browser)
Quantization (FP32 → INT8) can reduce model size 4x with minimal accuracy loss
Pruning removes near-zero weights, further reducing size and compute
Knowledge distillation trains a small "student" model to mimic a large "teacher" model

Infrastructure Options

Choosing where to run your model involves balancing cost, operational burden, latency, scalability, and team expertise. Options range from fully managed platforms to self-managed containers.

Option	Description	Ops Burden	Cost	Best For
Docker + VM	Package model + runtime in a container, run on a single VM	High (you manage everything)	Low	Small teams, prototypes, self-hosted environments
Kubernetes (K8s)	Orchestrate model containers across a cluster with auto-scaling, rolling updates	High (cluster management)	Moderate	Large-scale production, multi-model serving, internal platforms
AWS SageMaker	Fully managed ML platform — train, tune, deploy, monitor in one service	Low	High	AWS shops wanting end-to-end managed ML lifecycle
Google Vertex AI	GCP's managed ML platform with AutoML, custom training, online/batch prediction	Low	High	GCP shops, BigQuery data, integration with GCP data stack
Azure ML	Microsoft's managed ML platform with designer, pipelines, and model registry	Low	High	Azure shops, enterprise environments with existing Microsoft contracts
Serverless Functions	AWS Lambda, Google Cloud Functions — spin up on demand, no always-on server	Very low	Pay-per-request	Low-frequency requests; lightweight models; cost-sensitive workloads
Managed Inference APIs	Replicate, Modal, Banana, Runpod — deploy a Docker image, get an endpoint	Minimal	Per-second GPU billing	GPU models without K8s overhead; fast prototyping

Docker Best Practices for ML Serving

Pin exact versions of all packages (Python, CUDA, framework) in your Dockerfile. Use multi-stage builds to separate the training environment from the lean serving image. Include a health-check endpoint. Set memory limits to prevent OOM crashes from taking down the host. Copy model weights into the image rather than fetching at runtime to make deployments deterministic.

Right-Sizing Your Infrastructure

Start simple. A FastAPI app in a single Docker container handles thousands of requests per second for small models. Don't add Kubernetes until you have real scaling needs that Docker Compose or a single VM cannot handle. Premature orchestration complexity kills productivity. Scale when the load demands it, not when the architecture anticipates it.

Monitoring in Production

Deploying a model is not the end of the work — it's the beginning of a maintenance responsibility. Real-world data evolves, user behavior changes, and the distribution your model was trained on diverges from what it sees in production. Without monitoring, you won't know when your model has silently degraded.

Data Drift

Definition: The statistical distribution of input features changes over time, even if the relationship between features and target remains the same.

Example: A demand forecasting model trained on pre-pandemic shopping patterns sees completely different seasonality patterns after the pandemic. The input features (day of week, promotions) are the same but their distributions shifted.

Detection: Track feature distribution statistics (mean, std, quantiles) over time. Use statistical tests like Kolmogorov-Smirnov or Population Stability Index (PSI) to flag significant shifts.

Concept Drift

Definition: The relationship between input features and the target variable changes over time. The model's learned mapping is no longer correct even on correctly distributed data.

Example: A fraud detection model trained before a new fraud technique emerged performs well on old patterns but misses the new attack vector — the meaning of "fraudulent" transaction has changed.

Detection: Monitor prediction accuracy on labeled data as labels become available (delayed ground truth). Track prediction score distributions for unexpected shifts.

Production Monitoring Metrics & Strategies

Metric Category	What to Track	Tools	Alert When
System Metrics	Latency (p50/p95/p99), throughput (req/s), error rate, CPU/GPU/memory utilization	Prometheus, Grafana, Datadog, CloudWatch	p95 latency >2x baseline; error rate >1%
Prediction Metrics	Output score distribution, class balance of predictions, null/invalid output rate	Evidently AI, Arize, Fiddler, custom logging	Score distribution PSI > 0.2; null rate increases
Input Data Quality	Feature nulls, out-of-range values, schema violations, cardinality explosions	Great Expectations, Evidently, custom validators	Null rate increases by >5% on any feature
Business Metrics	Click-through rate, conversion rate, revenue impact, downstream KPIs	BI dashboards, A/B testing platform	KPI degrades significantly from baseline
Ground Truth Accuracy	Model accuracy as true labels become available (may be days/weeks delayed)	Custom pipelines, Arize, WhyLabs	Accuracy drops >5% from deployment baseline

MLOps Best Practices

MLOps applies DevOps principles to the ML lifecycle — bringing automation, reproducibility, and reliable delivery to a domain that has historically been artisanal and notebook-driven. Mature MLOps practices dramatically reduce the time from model development to reliable production deployment.

Model Versioning & Registry

Every model artifact should be versioned with full provenance: which training data, which code commit, which hyperparameters produced it. A model registry (MLflow, Weights & Biases, SageMaker Registry, Vertex Model Registry) stores these artifacts with metadata.

Tag models with training date, dataset version, and git commit hash
Maintain three stages: Staging, Production, Archived
Never overwrite a production model — create a new version and promote it
Store evaluation metrics alongside the artifact for comparison
Enable one-click rollback to any previous production version

CI/CD for Machine Learning

Automated pipelines that run on every commit to validate, evaluate, and (optionally) deploy model changes. Catches regressions before they reach production.

Run unit tests on data preprocessing and feature engineering code
Gate on model performance: new version must meet or beat the current production model's metrics
Automate retraining on fresh data using tools like Airflow, Kubeflow, or GitHub Actions
Build and push a new Docker image on every training run
Deploy to staging automatically; require manual approval for production promotion

A/B Testing & Shadow Mode

Before fully switching to a new model, validate it against the current production model on live traffic.

Shadow mode — New model runs in parallel, receiving the same traffic, but its predictions are logged, not served to users. Compare performance without any user exposure risk.
Canary release — Route 5–10% of live traffic to the new model. Monitor metrics before widening rollout.
A/B test — Randomize users between old and new model. Measure business metrics with statistical significance before fully switching.
Always define success criteria before starting an experiment

Rollback & Incident Response

Production models will occasionally fail. Having a fast, tested rollback procedure is as important as a good deployment procedure.

Maintain a "last known good" model in the registry — always deployable in <5 minutes
Implement feature flags or traffic controls to cut over instantly
Define model "circuit breaker" thresholds — auto-rollback if error rate exceeds threshold
Document a runbook for each model: how to rollback, who to notify, what to check first
Practice rollback drills to ensure the procedure actually works when under pressure

The Deployment Reality Check

The most sophisticated ML system is worthless if it can't be maintained in production. Prioritize: (1) correctness and safety of predictions, (2) observability so you know when something is wrong, (3) fast rollback capability, and (4) reproducible retraining pipelines. Fancy serving infrastructure is secondary to these fundamentals. Many successful production ML systems run as simple batch jobs or single-server REST APIs that are well-monitored and easily debugged.

Local LLM Inference Tools

Running large language models locally — on your own hardware — removes cloud costs, network latency, and data-privacy concerns. Three tools dominate the self-hosted LLM ecosystem. Each targets a different use case: Ollama optimises for ease of use, llama.cpp for CPU/resource-constrained environments, and vLLM for maximum GPU throughput in production.

Ollama →

The easiest way to run open-source LLMs. A single CLI downloads, manages, and serves quantized models with zero configuration. Exposes an OpenAI-compatible REST API on port 11434. Supports NVIDIA CUDA, Apple Metal, and AMD ROCm acceleration out of the box.

Best for: Developers, home labs, and anyone who wants a model running in under five minutes. Powers tools like Open WebUI, Dify, and LangChain local integrations.

macOS, Linux, Windows, Docker — one install script or brew install ollama
Model library: llama3.2, mistral, gemma3, phi4, qwen2.5, deepseek-r1, nomic-embed and 100+ more
OpenAI SDK compatible — change base_url and existing code works unchanged
Step-by-step installation guide →

llama.cpp →

A pure C/C++ inference engine that runs quantized LLMs with no Python dependency. Originally a LLaMA port, it now supports 50+ architectures. Defined the GGUF model format and supports partial GPU offloading — load as many layers as fit in VRAM, run the rest on CPU.

Best for: CPU-first deployments, embedded systems, servers without a discrete GPU, or scenarios where Python runtimes are unavailable.

Quantization levels from Q2_K (smallest) to Q8_0 (near-lossless) — pick the VRAM/quality trade-off
GGUF models from HuggingFace — bartowski and ggml-org repos maintain current quantizations
llama-server exposes an OpenAI-compatible API on any port; llama-cpp-python for in-process use
Step-by-step installation guide →

vLLM →

A production inference engine from UC Berkeley built around PagedAttention — a KV-cache management technique that treats GPU memory like OS virtual memory, eliminating padding waste. Combined with continuous batching, vLLM achieves 10–24× higher throughput than naive HuggingFace serving.

Best for: Production deployments with sustained load, multi-user APIs, and workloads where GPU throughput matters. Requires NVIDIA GPU (≥16 GB VRAM recommended).

Full OpenAI API compatibility — chat completions, streaming, embeddings, model listing
AWQ, GPTQ, and FP8 quantization; tensor parallelism across multiple GPUs
Prometheus metrics endpoint, prefix caching, structured output (JSON schema)
Step-by-step installation guide →

Choosing the Right Tool

Criterion	Ollama	llama.cpp	vLLM
Setup difficulty	⭐ Easiest	Moderate (compile)	Moderate (GPU needed)
GPU required?	No (accelerates if present)	No (offloads if present)	Yes (NVIDIA recommended)
macOS support	✅ Full (Metal)	✅ Full (Metal)	⚠️ CPU only
Throughput (GPU)	Good	Good	Excellent (PagedAttention)
CPU inference	Supported	Excellent (AVX2)	Limited
Model format	GGUF (auto-managed)	GGUF (manual)	HuggingFace (safetensors)
OpenAI API compat	✅ Yes	✅ Yes	✅ Yes
Best for	Dev, home lab, desktop	CPU servers, edge, embedded	Production, high-load APIs