vLLM — High-Throughput LLM Serving

Production-grade inference engine with PagedAttention, continuous batching, and OpenAI-compatible APIs

vLLM LLM NVIDIA GPU CUDA OpenAI API Production

Local LLM Inference Series

⏱ 15 min read 📊 Advanced 🗓 Updated Jan 2025

What is vLLM?

vLLM is an open-source LLM inference engine developed at UC Berkeley, optimized for maximum GPU throughput in production environments. It introduced PagedAttention — a technique that manages the KV cache like OS virtual memory, dramatically reducing GPU memory fragmentation and enabling much higher throughput than naive implementations. vLLM exposes an OpenAI-compatible REST API, making it a drop-in replacement for the OpenAI API in any existing application.

PagedAttention & Continuous Batching

PagedAttention stores the KV cache in non-contiguous “pages” (similar to virtual memory paging), eliminating the memory waste caused by padding and eager reservation. Continuous batching dynamically adds new requests to in-flight batches as soon as sequence slots free up, maximizing GPU utilization instead of waiting for the entire batch to finish.

Together these techniques deliver 10–24× higher throughput compared to a naive HuggingFace Transformers serving setup, without any change to model weights or output quality.

OpenAI-Compatible API

vLLM exposes /v1/chat/completions, /v1/completions, /v1/models, and /v1/embeddings — endpoints identical to the OpenAI REST API. Any code using the openai Python SDK or curl against OpenAI can be redirected to a local vLLM server simply by changing base_url. Streaming (stream: true) is fully supported across all text generation endpoints.

Production-Ready Features

vLLM ships with the features needed for real production deployments:

Tensor parallelism — split a model across multiple GPUs with --tensor-parallel-size N
Quantization — load AWQ, GPTQ, or FP8 quantized models to reduce VRAM usage
Speculative decoding — use a small draft model to accelerate generation of the main model
Prefix caching — reuse KV cache for repeated system prompts, saving VRAM and compute
Structured output — enforce JSON schema on model responses
Prometheus metrics — built-in /metrics endpoint compatible with Grafana dashboards

Hardware Requirements

GPU (Required for production)

NVIDIA GPU with CUDA Compute Capability ≥ 7.0. Recommended: A10G (24 GB), RTX 3090/4090 (24 GB), L4 (24 GB), A100 (40/80 GB). Minimum: RTX 3080 10 GB for small models (≤7B with quantization). AMD ROCm support is experimental.

CUDA & Python

CUDA 12.1 or 12.4 (match your PyTorch build). Python 3.9–3.12. PyTorch ≥ 2.3. Both nvidia-smi and nvcc --version must work. The vllm pip package auto-installs a matching PyTorch wheel.

Memory

Full-precision (BF16): 7B needs ~14 GB VRAM, 13B needs ~26 GB, 70B needs ~140 GB (multi-GPU). With AWQ/GPTQ 4-bit quantization: 7B fits in ~6 GB, 13B in ~10 GB. System RAM should be ≥ model size for initial weight loading.

Installation

Verify CUDA is available

nvidia-smi
nvcc --version

If nvcc is missing, install it: sudo apt install nvidia-cuda-toolkit

Create a Python virtual environment (recommended)

python3 -m venv vllm-env
source vllm-env/bin/activate

Install vLLM (installs PyTorch + CUDA automatically)

pip install vllm

This pulls in torch, transformers, tokenizers, and all dependencies (~4–8 GB download). To pin a specific version: pip install vllm==0.6.6

(Optional) HuggingFace login for gated models

Required for Llama, Gemma, and other gated models:

pip install huggingface_hub
huggingface-cli login
# Paste your token from huggingface.co/settings/tokens

Launch the OpenAI-compatible server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --port 8000 \
  --host 0.0.0.0

Server is ready when you see: INFO: Uvicorn running on http://0.0.0.0:8000

Test the endpoint

curl http://localhost:8000/v1/models

Install nvidia-container-toolkit (enables `--gpus` in Docker)

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Pull the vLLM image

docker pull vllm/vllm-openai:latest

Run the container (single GPU)

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --max-model-len 8192

--ipc=host is required for PyTorch shared memory. HF_TOKEN should be set in your shell environment.

Docker Compose (recommended for production)

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - huggingface_cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host
    command: >
      --model meta-llama/Llama-3.2-3B-Instruct
      --max-model-len 8192
      --gpu-memory-utilization 0.90
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  huggingface_cache:

Start: HF_TOKEN=hf_xxx docker compose up -d

Note: vLLM does not support Apple Metal GPU acceleration. On macOS it runs CPU-only using PyTorch's MPS backend (limited support). This is suitable for development and testing only — not for production throughput. Consider Ollama or llama.cpp for macOS deployment.

Install (Python 3.11 recommended)

pip install vllm

Expect slow model load times and very low tokens/sec on CPU.

Launch with CPU

VLLM_CPU_ONLY=1 python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --device cpu \
  --port 8000

Use only small models (≤3B parameters) on CPU.

Key Configuration Flags

Flag	Default	Description
`--model`	—	HuggingFace model ID or local path (required)
`--port`	8000	API server port
`--host`	127.0.0.1	Bind address (use 0.0.0.0 for LAN)
`--tensor-parallel-size`	1	Number of GPUs for tensor parallelism
`--max-model-len`	from config	Context length override (tokens)
`--gpu-memory-utilization`	0.90	Fraction of GPU VRAM to use for KV cache
`--quantization`	none	`awq`, `gptq`, `fp8`, `bitsandbytes`
`--dtype`	auto	`bfloat16`, `float16`, `float32`
`--served-model-name`	model name	Alias returned by /v1/models endpoint
`--api-key`	none	Bearer token for API auth
`--max-num-seqs`	256	Max concurrent sequences
`--enable-prefix-caching`	off	Cache KV for repeated prefixes (saves VRAM)
`--enable-chunked-prefill`	off	Interleave prompt and decode for better latency

Quantization

Quantization reduces model weight precision to lower VRAM requirements at a small quality cost. vLLM supports loading pre-quantized models from HuggingFace — no quantization step is needed on your end. Simply pass the --quantization flag matching the model's format.

AWQ (4-bit)

Activation-aware Weight Quantization. Currently the best 4-bit method for quality. Models have an -AWQ suffix on HuggingFace (e.g. TheBloke/Mistral-7B-Instruct-v0.2-AWQ). Launch with --quantization awq. Requires AWQ-quantized model weights.

GPTQ (4-bit)

Gradient-based Post-Training Quantization. Widely available (many TheBloke HuggingFace repos). Launch with --quantization gptq. Slightly lower quality than AWQ at the same bit-width. Available in 3-bit, 4-bit, and 8-bit variants.

FP8 (8-bit float)

Newer format supported on H100, H200, and some A100 GPUs. Higher quality than INT4 quantization, lower memory than FP16. Launch with --quantization fp8. Requires models saved in FP8 format or dynamic quantization (some accuracy loss).

OpenAI-Compatible API Usage

cURL Examples

# Chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Streaming response
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.2-3B-Instruct","messages":[{"role":"user","content":"Count to 10"}],"stream":true}'

# List available models
curl http://localhost:8000/v1/models

Python SDK — Drop-in OpenAI Replacement

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",  # or your --api-key value
)

# Streaming chat
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "What is tensor parallelism?"},
    ],
    stream=True,
    temperature=0.7,
    max_tokens=256,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Multi-GPU Setup

Tensor parallelism splits each transformer layer across N GPUs. Requires N GPUs of the same type on one machine. Enable with --tensor-parallel-size N:

# 2-GPU setup
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

For NCCL communication between GPUs, ensure --ipc=host in Docker
Pipeline parallelism (across multiple nodes) is also supported but requires additional configuration

Troubleshooting

`CUDA out of memory`

Model too large for available VRAM. Options:

Use an AWQ/GPTQ 4-bit quantized model with --quantization awq
Reduce --gpu-memory-utilization to 0.80 to leave headroom
Reduce --max-model-len to shrink the KV cache allocation
Add a second GPU and use --tensor-parallel-size 2

404 on `/v1/chat/completions`

The model is still loading (1–10 minutes for large models). Wait for the log line INFO: Application startup complete. Check /v1/models — if empty, model has not finished loading.

`gated repo` / 403 error

Gated models (Llama, Gemma) require a HuggingFace account agreement. Visit the model's HuggingFace page, accept the license, then set HUGGING_FACE_HUB_TOKEN or run huggingface-cli login.

Very high latency on first request

Model warmup (CUDA graph capture) runs on the first few requests. This is normal behavior — subsequent requests are fast. Use --enforce-eager to skip CUDA graph capture (lower throughput, no warmup delay).

`ImportError: cannot import name 'FlashAttention'`

Flash Attention 2 requires pip install flash-attn separately (builds from source, ~20 min). vLLM uses its own attention implementation by default — this error appears only when you explicitly request flash_attention_2 via transformers. Not needed for standard vLLM serving.

Docker container can't access GPU

Verify nvidia-container-toolkit is installed and Docker was restarted after installation. Test with:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi