Taking a trained model from notebook to production
Before a model can be deployed, it must be serialized — saved from its in-memory representation to a persistent format on disk or in object storage. The choice of serialization format affects portability, security, performance, and cross-language compatibility.
Python pickle files can execute arbitrary code when deserialized. Loading a pickle file from an untrusted source is equivalent to running a shell script from that source. Never unpickle model files downloaded from public repositories without verification. For sharing models publicly, use ONNX, SafeTensors, or PMML — formats that cannot execute arbitrary code during loading.
| Scenario | Recommended Format | Key Reason |
|---|---|---|
| Internal scikit-learn model | Joblib | Efficient for numpy arrays; simple; Python ecosystem |
| Cross-language / production deployment | ONNX | Language-agnostic; hardware-optimized inference runtimes |
| TensorFlow production serving | SavedModel + TF Serving | Native format; supports batching, versioning, warm-up |
| PyTorch production (C++ backend) | TorchScript | Runs without Python interpreter |
| Sharing LLMs publicly | SafeTensors | Security; fast loading; HuggingFace ecosystem |
| Mobile / edge deployment | TFLite / Core ML / ONNX Mobile | Optimized for limited compute and memory |
How you serve predictions depends on latency requirements, throughput needs, compute budget, and where the request originates. The four primary patterns span a spectrum from real-time to deferred to embedded inference.
Expose the model as an HTTP endpoint. Clients send a JSON payload with input features; the server responds with predictions synchronously. This is the most common production pattern for interactive applications requiring low-latency responses.
Frameworks: FastAPI (Python, async, auto OpenAPI docs), Flask (simple, synchronous), TF Serving (TensorFlow-native, gRPC + HTTP), Triton Inference Server (NVIDIA, multi-framework, batching), Ray Serve (Python, scalable), BentoML (packaging + serving).
Process large volumes of data offline on a schedule (hourly, daily, weekly). Predictions are written to a database or file store where downstream systems consume them. No real-time latency requirement.
Ideal when predictions can be precomputed: e-commerce product scoring, content recommendation pre-ranking, overnight risk scoring, fraud review queues.
Consume predictions as events flow through a data pipeline in near-real-time. Integrated with message queues (Kafka, Kinesis, Pub/Sub). The model is applied to each event as it arrives, with results published downstream.
Used for: fraud detection on payment events, real-time content moderation, IoT anomaly detection, live ad bidding pipelines.
Run the model entirely on the user's device (phone, browser, embedded system). No network round-trip; predictions are private; works offline. Requires aggressive model compression.
Used for: keyboard next-word prediction, on-device wake word detection (Alexa/Siri), autonomous vehicle perception, medical device diagnostics.
Choosing where to run your model involves balancing cost, operational burden, latency, scalability, and team expertise. Options range from fully managed platforms to self-managed containers.
| Option | Description | Ops Burden | Cost | Best For |
|---|---|---|---|---|
| Docker + VM | Package model + runtime in a container, run on a single VM | High (you manage everything) | Low | Small teams, prototypes, self-hosted environments |
| Kubernetes (K8s) | Orchestrate model containers across a cluster with auto-scaling, rolling updates | High (cluster management) | Moderate | Large-scale production, multi-model serving, internal platforms |
| AWS SageMaker | Fully managed ML platform — train, tune, deploy, monitor in one service | Low | High | AWS shops wanting end-to-end managed ML lifecycle |
| Google Vertex AI | GCP's managed ML platform with AutoML, custom training, online/batch prediction | Low | High | GCP shops, BigQuery data, integration with GCP data stack |
| Azure ML | Microsoft's managed ML platform with designer, pipelines, and model registry | Low | High | Azure shops, enterprise environments with existing Microsoft contracts |
| Serverless Functions | AWS Lambda, Google Cloud Functions — spin up on demand, no always-on server | Very low | Pay-per-request | Low-frequency requests; lightweight models; cost-sensitive workloads |
| Managed Inference APIs | Replicate, Modal, Banana, Runpod — deploy a Docker image, get an endpoint | Minimal | Per-second GPU billing | GPU models without K8s overhead; fast prototyping |
Pin exact versions of all packages (Python, CUDA, framework) in your Dockerfile. Use multi-stage builds to separate the training environment from the lean serving image. Include a health-check endpoint. Set memory limits to prevent OOM crashes from taking down the host. Copy model weights into the image rather than fetching at runtime to make deployments deterministic.
Start simple. A FastAPI app in a single Docker container handles thousands of requests per second for small models. Don't add Kubernetes until you have real scaling needs that Docker Compose or a single VM cannot handle. Premature orchestration complexity kills productivity. Scale when the load demands it, not when the architecture anticipates it.
Deploying a model is not the end of the work — it's the beginning of a maintenance responsibility. Real-world data evolves, user behavior changes, and the distribution your model was trained on diverges from what it sees in production. Without monitoring, you won't know when your model has silently degraded.
Definition: The statistical distribution of input features changes over time, even if the relationship between features and target remains the same.
Example: A demand forecasting model trained on pre-pandemic shopping patterns sees completely different seasonality patterns after the pandemic. The input features (day of week, promotions) are the same but their distributions shifted.
Detection: Track feature distribution statistics (mean, std, quantiles) over time. Use statistical tests like Kolmogorov-Smirnov or Population Stability Index (PSI) to flag significant shifts.
Definition: The relationship between input features and the target variable changes over time. The model's learned mapping is no longer correct even on correctly distributed data.
Example: A fraud detection model trained before a new fraud technique emerged performs well on old patterns but misses the new attack vector — the meaning of "fraudulent" transaction has changed.
Detection: Monitor prediction accuracy on labeled data as labels become available (delayed ground truth). Track prediction score distributions for unexpected shifts.
| Metric Category | What to Track | Tools | Alert When |
|---|---|---|---|
| System Metrics | Latency (p50/p95/p99), throughput (req/s), error rate, CPU/GPU/memory utilization | Prometheus, Grafana, Datadog, CloudWatch | p95 latency >2x baseline; error rate >1% |
| Prediction Metrics | Output score distribution, class balance of predictions, null/invalid output rate | Evidently AI, Arize, Fiddler, custom logging | Score distribution PSI > 0.2; null rate increases |
| Input Data Quality | Feature nulls, out-of-range values, schema violations, cardinality explosions | Great Expectations, Evidently, custom validators | Null rate increases by >5% on any feature |
| Business Metrics | Click-through rate, conversion rate, revenue impact, downstream KPIs | BI dashboards, A/B testing platform | KPI degrades significantly from baseline |
| Ground Truth Accuracy | Model accuracy as true labels become available (may be days/weeks delayed) | Custom pipelines, Arize, WhyLabs | Accuracy drops >5% from deployment baseline |
MLOps applies DevOps principles to the ML lifecycle — bringing automation, reproducibility, and reliable delivery to a domain that has historically been artisanal and notebook-driven. Mature MLOps practices dramatically reduce the time from model development to reliable production deployment.
Every model artifact should be versioned with full provenance: which training data, which code commit, which hyperparameters produced it. A model registry (MLflow, Weights & Biases, SageMaker Registry, Vertex Model Registry) stores these artifacts with metadata.
Automated pipelines that run on every commit to validate, evaluate, and (optionally) deploy model changes. Catches regressions before they reach production.
Before fully switching to a new model, validate it against the current production model on live traffic.
Production models will occasionally fail. Having a fast, tested rollback procedure is as important as a good deployment procedure.
The most sophisticated ML system is worthless if it can't be maintained in production. Prioritize: (1) correctness and safety of predictions, (2) observability so you know when something is wrong, (3) fast rollback capability, and (4) reproducible retraining pipelines. Fancy serving infrastructure is secondary to these fundamentals. Many successful production ML systems run as simple batch jobs or single-server REST APIs that are well-monitored and easily debugged.
Running large language models locally — on your own hardware — removes cloud costs, network latency, and data-privacy concerns. Three tools dominate the self-hosted LLM ecosystem. Each targets a different use case: Ollama optimises for ease of use, llama.cpp for CPU/resource-constrained environments, and vLLM for maximum GPU throughput in production.
The easiest way to run open-source LLMs. A single CLI downloads, manages, and serves quantized models with zero configuration. Exposes an OpenAI-compatible REST API on port 11434. Supports NVIDIA CUDA, Apple Metal, and AMD ROCm acceleration out of the box.
Best for: Developers, home labs, and anyone who wants a model running in under five minutes. Powers tools like Open WebUI, Dify, and LangChain local integrations.
brew install ollamabase_url and existing code works unchangedA pure C/C++ inference engine that runs quantized LLMs with no Python dependency. Originally a LLaMA port, it now supports 50+ architectures. Defined the GGUF model format and supports partial GPU offloading — load as many layers as fit in VRAM, run the rest on CPU.
Best for: CPU-first deployments, embedded systems, servers without a discrete GPU, or scenarios where Python runtimes are unavailable.
bartowski and ggml-org repos maintain current quantizationsllama-server exposes an OpenAI-compatible API on any port; llama-cpp-python for in-process useA production inference engine from UC Berkeley built around PagedAttention — a KV-cache management technique that treats GPU memory like OS virtual memory, eliminating padding waste. Combined with continuous batching, vLLM achieves 10–24× higher throughput than naive HuggingFace serving.
Best for: Production deployments with sustained load, multi-user APIs, and workloads where GPU throughput matters. Requires NVIDIA GPU (≥16 GB VRAM recommended).
| Criterion | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Setup difficulty | ⭐ Easiest | Moderate (compile) | Moderate (GPU needed) |
| GPU required? | No (accelerates if present) | No (offloads if present) | Yes (NVIDIA recommended) |
| macOS support | ✅ Full (Metal) | ✅ Full (Metal) | ⚠️ CPU only |
| Throughput (GPU) | Good | Good | Excellent (PagedAttention) |
| CPU inference | Supported | Excellent (AVX2) | Limited |
| Model format | GGUF (auto-managed) | GGUF (manual) | HuggingFace (safetensors) |
| OpenAI API compat | ✅ Yes | ✅ Yes | ✅ Yes |
| Best for | Dev, home lab, desktop | CPU servers, edge, embedded | Production, high-load APIs |