Run quantized LLMs on CPU or GPU with a single self-contained binary — no Python runtime required
llama.cpp is a high-performance LLM inference engine written in pure C/C++. Originally created by Georgi Gerganov as a port of Meta's LLaMA inference code, it has grown into a full inference ecosystem supporting 50+ model architectures, multiple hardware backends, and the GGUF quantization format. Its defining advantage: a single compiled binary with zero runtime dependencies can load and run billion-parameter models on commodity hardware.
Originally a port of LLaMA inference to C++, llama.cpp is now a complete inference engine supporting 50+ model architectures including Llama 3, Mistral, Gemma, Phi, Qwen, DeepSeek, and more. The binary has zero Python dependencies and runs on any platform that can compile C++.
A single executable handles the entire inference pipeline: model loading, quantization, KV-cache management, context windowing, and sampling strategies (greedy, top-k, top-p, Mirostat). No virtual environment, no pip, no runtime to maintain.
llama.cpp defined the GGUF format (successor to GGML), now the de-facto standard for quantized LLM distribution. Models are single files containing weights, tokenizer vocabulary, and all necessary metadata — everything needed for inference is self-contained.
Available in quantization levels from Q2_K (smallest, most quality-degraded) through Q8_0 (near-lossless, ~99% of FP16 quality). The most popular quantized GGUF builds are hosted on HuggingFace by the bartowski and ggml-org organizations, covering nearly every major open-weight model.
CPU inference uses AVX2/AVX-512 SIMD optimization for maximum throughput on x86 hardware. GPU backends include CUDA (NVIDIA), Metal (Apple Silicon), ROCm/HIPBlas (AMD), and Vulkan (cross-platform, any GPU).
A particularly powerful feature is partial GPU offloading: the -ngl flag controls how many transformer layers run on GPU versus CPU RAM. This lets you split a model that doesn't fully fit in VRAM — layers on GPU run at full GPU speed, overflow layers run on CPU. Practical for running 13B+ models on a consumer GPU.
llama.cpp can be installed via package managers or compiled from source. Compiling from source is required for cutting-edge model support and GPU backends on Linux. On macOS, Homebrew provides a convenient pre-built option with Metal acceleration.
The recommended path for most macOS users. Metal GPU acceleration is enabled automatically on Apple Silicon.
brew install llama.cpp
This installs llama-server, llama-cli, and all tools to /opt/homebrew/bin/ (Apple Silicon) or /usr/local/bin/ (Intel). No PATH changes needed.
Needed for the very latest model support or experimental features not yet in the Homebrew formula.
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_METAL=ON cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
Binaries will be in build/bin/. Metal (-DGGML_METAL=ON) enables GPU acceleration on Apple Silicon.
llama-server --version
You should see the version string and the list of enabled backends (Metal will appear for Apple Silicon Homebrew installs).
The Homebrew build is usually a few commits behind the main branch. Compile from source if you need the latest model architectures or experimental sampling features.
sudo apt update && sudo apt install -y \ build-essential cmake git curl \ libcurl4-openssl-dev
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp
CPU-only build:
cmake -B build cmake --build build --config Release -j$(nproc)
NVIDIA CUDA build (requires CUDA Toolkit 12.x):
cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j$(nproc)
Verify CUDA: ./build/bin/llama-server --version should show CUDA in the backend list.
AMD ROCm build:
cmake -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS="gfx1100" cmake --build build --config Release -j$(nproc)
Replace gfx1100 with your GPU's target: gfx906 for Vega, gfx1030 for RX 6000 series.
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
Without this, invoke binaries as ./build/bin/llama-server from inside the repo directory.
docker pull ghcr.io/ggerganov/llama.cpp:server
mkdir -p ~/llama-models
Download a .gguf file into ~/llama-models before running the container. See Section 3 below for download instructions.
docker run -p 8080:8080 \ -v ~/llama-models:/models \ ghcr.io/ggerganov/llama.cpp:server \ -m /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -c 4096 -np 4
--gpus all and use the CUDA image tag)docker run --gpus all \ -p 8080:8080 \ -v ~/llama-models:/models \ ghcr.io/ggerganov/llama.cpp:server-cuda \ -m /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -c 4096 -ngl 99 -np 4
The :server-cuda tag is pre-built with CUDA support. -ngl 99 offloads all layers to GPU.
GGUF models are hosted on HuggingFace. The bartowski user maintains high-quality quantized builds of most popular open-weight models. Install the HuggingFace CLI to download specific quantization files without pulling the entire repository:
# Install HuggingFace CLI pip install huggingface_hub # Download a specific quantization (Q4_K_M recommended for balance) huggingface-cli download \ bartowski/Llama-3.2-3B-Instruct-GGUF \ --include "Llama-3.2-3B-Instruct-Q4_K_M.gguf" \ --local-dir ~/llama-models # Download Mistral 7B Q5_K_M huggingface-cli download \ bartowski/Mistral-7B-Instruct-v0.3-GGUF \ --include "Mistral-7B-Instruct-v0.3-Q5_K_M.gguf" \ --local-dir ~/llama-models
Or download directly via wget using the URL from the HuggingFace file page:
# Download directly via wget (get the URL from the HuggingFace file page) wget -P ~/llama-models \ "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"
The suffix in a GGUF filename indicates quantization level. Lower quantization = smaller file + faster inference + lower quality. Q4_K_M is the recommended default for most use cases, offering a practical balance of size, speed, and output quality.
| Quantization | Bits/weight | 7B Model Size | Quality vs FP16 | Best For |
|---|---|---|---|---|
Q2_K | ~2.6 | ~2.8 GB | ~75% | Extreme RAM constraints |
Q3_K_M | ~3.4 | ~3.3 GB | ~83% | Very constrained RAM |
Q4_K_M | ~4.8 | ~4.4 GB | ~90% | Recommended default |
Q5_K_M | ~5.7 | ~5.2 GB | ~94% | High quality, moderate RAM |
Q6_K | ~6.6 | ~6.0 GB | ~97% | Near-lossless, more VRAM |
Q8_0 | ~8.5 | ~7.7 GB | ~99% | Reference quality, benchmarking |
F16 | 16 | ~14.5 GB | 100% | Training, maximum accuracy |
llama-server exposes an OpenAI-compatible HTTP API. It handles concurrent requests, streaming, and chat completions — compatible with any client that supports the OpenAI API format, including the openai Python SDK.
Basic launch:
# Basic server on port 8080 llama-server \ -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -c 4096 # context size in tokens \ -np 4 # parallel request slots
With GPU offloading:
# NVIDIA: offload all layers to GPU llama-server \ -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -c 8192 -np 4 \ -ngl 99 # number of GPU layers (99 = all) # Partial offload (model too big for full VRAM) llama-server -m model.gguf -ngl 24 -c 4096
Test with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-cpp",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
| Flag | Short | Default | Description |
|---|---|---|---|
--model | -m | — | Path to .gguf file (required) |
--host | 127.0.0.1 | Bind address (use 0.0.0.0 for LAN) | |
--port | 8080 | HTTP port | |
--ctx-size | -c | 512 | Context window in tokens |
--n-gpu-layers | -ngl | 0 | Layers to offload to GPU (99 = all) |
--parallel | -np | 1 | Simultaneous request slots |
--threads | -t | auto | CPU threads for generation |
--batch-size | -b | 512 | Prompt processing batch size |
--flash-attn | -fa | off | Enable FlashAttention (faster, less VRAM) |
--api-key | none | Bearer token for API authentication | |
--log-disable | off | Disable verbose startup logs |
llama-cpp-python wraps llama.cpp in Python, supporting the OpenAI SDK interface and direct integration with llama_index and LangChain. It compiles the C++ library at install time, so GPU flags must be passed via CMAKE_ARGS.
# Install (CPU) pip install llama-cpp-python # Install with CUDA support CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir # Install with Metal (macOS) CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
n_gpu_layers=-1, # -1 = all layers on GPU
n_ctx=4096, # context window
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain attention in transformers."},
]
)
print(response["choices"][0]["message"]["content"])
After building with -DGGML_CUDA=ON, run ./build/bin/llama-server --version and check for CUDA in the output. Ensure CUDA Toolkit 12.x is installed (nvcc --version) and that CUDA libraries are in LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
GGML_ASSERT crash on loadUsually a corrupt or truncated download. Re-download the GGUF file and verify the size matches the HuggingFace listing. Check the checksum:
sha256sum model.gguf
Compare against the .sha256 value shown on the HuggingFace file page.
Reduce -ngl to offload fewer layers to GPU (e.g., -ngl 24 instead of 99). The remaining layers run on CPU RAM. Alternatively, switch to a more aggressively quantized model (Q4_K_M → Q3_K_M), or reduce -c context size, which also consumes VRAM for the KV cache.
AVX2 is required for fast CPU inference. Check: grep avx2 /proc/cpuinfo. If you accidentally built with the CUDA flag but CUDA is misconfigured, the CPU fallback may use a slow non-AVX code path — rebuild without -DGGML_CUDA=ON. Also reduce -np parallel slots on CPU-only machines to avoid memory thrashing.
llama-server: command not foundWhen compiled from source, binaries live in build/bin/ inside the repo. Run as ./build/bin/llama-server, or add the directory to PATH:
export PATH="$HOME/llama.cpp/build/bin:$PATH"
After brew install llama.cpp, the command is available globally — no PATH changes needed.
Ensure you are using the correct prompt template for the model. Instruct-tuned models require chat format wrappers — <|im_start|> / <|im_end|> (ChatML), [INST] (Mistral), or <|start_header_id|> (Llama 3). The /v1/chat/completions endpoint applies the template automatically from model metadata; the raw /completion endpoint does not.