Production-grade inference engine with PagedAttention, continuous batching, and OpenAI-compatible APIs
vLLM is an open-source LLM inference engine developed at UC Berkeley, optimized for maximum GPU throughput in production environments. It introduced PagedAttention — a technique that manages the KV cache like OS virtual memory, dramatically reducing GPU memory fragmentation and enabling much higher throughput than naive implementations. vLLM exposes an OpenAI-compatible REST API, making it a drop-in replacement for the OpenAI API in any existing application.
PagedAttention stores the KV cache in non-contiguous “pages” (similar to virtual memory paging), eliminating the memory waste caused by padding and eager reservation. Continuous batching dynamically adds new requests to in-flight batches as soon as sequence slots free up, maximizing GPU utilization instead of waiting for the entire batch to finish.
Together these techniques deliver 10–24× higher throughput compared to a naive HuggingFace Transformers serving setup, without any change to model weights or output quality.
vLLM exposes /v1/chat/completions, /v1/completions, /v1/models, and /v1/embeddings — endpoints identical to the OpenAI REST API. Any code using the openai Python SDK or curl against OpenAI can be redirected to a local vLLM server simply by changing base_url. Streaming (stream: true) is fully supported across all text generation endpoints.
vLLM ships with the features needed for real production deployments:
--tensor-parallel-size N/metrics endpoint compatible with Grafana dashboardsNVIDIA GPU with CUDA Compute Capability ≥ 7.0. Recommended: A10G (24 GB), RTX 3090/4090 (24 GB), L4 (24 GB), A100 (40/80 GB). Minimum: RTX 3080 10 GB for small models (≤7B with quantization). AMD ROCm support is experimental.
CUDA 12.1 or 12.4 (match your PyTorch build). Python 3.9–3.12. PyTorch ≥ 2.3. Both nvidia-smi and nvcc --version must work. The vllm pip package auto-installs a matching PyTorch wheel.
Full-precision (BF16): 7B needs ~14 GB VRAM, 13B needs ~26 GB, 70B needs ~140 GB (multi-GPU). With AWQ/GPTQ 4-bit quantization: 7B fits in ~6 GB, 13B in ~10 GB. System RAM should be ≥ model size for initial weight loading.
nvidia-smi nvcc --version
If nvcc is missing, install it: sudo apt install nvidia-cuda-toolkit
python3 -m venv vllm-env source vllm-env/bin/activate
pip install vllm
This pulls in torch, transformers, tokenizers, and all dependencies (~4–8 GB download). To pin a specific version: pip install vllm==0.6.6
Required for Llama, Gemma, and other gated models:
pip install huggingface_hub huggingface-cli login # Paste your token from huggingface.co/settings/tokens
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --port 8000 \ --host 0.0.0.0
Server is ready when you see: INFO: Uvicorn running on http://0.0.0.0:8000
curl http://localhost:8000/v1/models
--gpus in Docker)distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit sudo systemctl restart docker
docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.2-3B-Instruct \ --max-model-len 8192
--ipc=host is required for PyTorch shared memory. HF_TOKEN should be set in your shell environment.
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- huggingface_cache:/root/.cache/huggingface
ports:
- "8000:8000"
ipc: host
command: >
--model meta-llama/Llama-3.2-3B-Instruct
--max-model-len 8192
--gpu-memory-utilization 0.90
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
huggingface_cache:
Start: HF_TOKEN=hf_xxx docker compose up -d
pip install vllm
Expect slow model load times and very low tokens/sec on CPU.
VLLM_CPU_ONLY=1 python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-1.5B-Instruct \ --device cpu \ --port 8000
Use only small models (≤3B parameters) on CPU.
| Flag | Default | Description |
|---|---|---|
--model | — | HuggingFace model ID or local path (required) |
--port | 8000 | API server port |
--host | 127.0.0.1 | Bind address (use 0.0.0.0 for LAN) |
--tensor-parallel-size | 1 | Number of GPUs for tensor parallelism |
--max-model-len | from config | Context length override (tokens) |
--gpu-memory-utilization | 0.90 | Fraction of GPU VRAM to use for KV cache |
--quantization | none | awq, gptq, fp8, bitsandbytes |
--dtype | auto | bfloat16, float16, float32 |
--served-model-name | model name | Alias returned by /v1/models endpoint |
--api-key | none | Bearer token for API auth |
--max-num-seqs | 256 | Max concurrent sequences |
--enable-prefix-caching | off | Cache KV for repeated prefixes (saves VRAM) |
--enable-chunked-prefill | off | Interleave prompt and decode for better latency |
Quantization reduces model weight precision to lower VRAM requirements at a small quality cost. vLLM supports loading pre-quantized models from HuggingFace — no quantization step is needed on your end. Simply pass the --quantization flag matching the model's format.
Activation-aware Weight Quantization. Currently the best 4-bit method for quality. Models have an -AWQ suffix on HuggingFace (e.g. TheBloke/Mistral-7B-Instruct-v0.2-AWQ). Launch with --quantization awq. Requires AWQ-quantized model weights.
Gradient-based Post-Training Quantization. Widely available (many TheBloke HuggingFace repos). Launch with --quantization gptq. Slightly lower quality than AWQ at the same bit-width. Available in 3-bit, 4-bit, and 8-bit variants.
Newer format supported on H100, H200, and some A100 GPUs. Higher quality than INT4 quantization, lower memory than FP16. Launch with --quantization fp8. Requires models saved in FP8 format or dynamic quantization (some accuracy loss).
# Chat completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention in simple terms."}
],
"temperature": 0.7,
"max_tokens": 512
}'
# Streaming response
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.2-3B-Instruct","messages":[{"role":"user","content":"Count to 10"}],"stream":true}'
# List available models
curl http://localhost:8000/v1/models
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="none", # or your --api-key value
)
# Streaming chat
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "What is tensor parallelism?"},
],
stream=True,
temperature=0.7,
max_tokens=256,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
Tensor parallelism splits each transformer layer across N GPUs. Requires N GPUs of the same type on one machine. Enable with --tensor-parallel-size N:
# 2-GPU setup python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --port 8000
--ipc=host in DockerCUDA out of memoryModel too large for available VRAM. Options:
--quantization awq--gpu-memory-utilization to 0.80 to leave headroom--max-model-len to shrink the KV cache allocation--tensor-parallel-size 2/v1/chat/completionsThe model is still loading (1–10 minutes for large models). Wait for the log line INFO: Application startup complete. Check /v1/models — if empty, model has not finished loading.
gated repo / 403 errorGated models (Llama, Gemma) require a HuggingFace account agreement. Visit the model's HuggingFace page, accept the license, then set HUGGING_FACE_HUB_TOKEN or run huggingface-cli login.
Model warmup (CUDA graph capture) runs on the first few requests. This is normal behavior — subsequent requests are fast. Use --enforce-eager to skip CUDA graph capture (lower throughput, no warmup delay).
ImportError: cannot import name 'FlashAttention'Flash Attention 2 requires pip install flash-attn separately (builds from source, ~20 min). vLLM uses its own attention implementation by default — this error appears only when you explicitly request flash_attention_2 via transformers. Not needed for standard vLLM serving.
Verify nvidia-container-toolkit is installed and Docker was restarted after installation. Test with:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi