llama.cpp — Compile-Once CPU/GPU Inference

Run quantized LLMs on CPU or GPU with a single self-contained binary — no Python runtime required

llama.cpp GGUF CPU Inference CUDA Metal Quantization
Local LLM Inference Series
⏱ 14 min read 📊 Advanced 🗓 Updated Jan 2025

What is llama.cpp?

llama.cpp is a high-performance LLM inference engine written in pure C/C++. Originally created by Georgi Gerganov as a port of Meta's LLaMA inference code, it has grown into a full inference ecosystem supporting 50+ model architectures, multiple hardware backends, and the GGUF quantization format. Its defining advantage: a single compiled binary with zero runtime dependencies can load and run billion-parameter models on commodity hardware.

Pure C/C++ Implementation

Originally a port of LLaMA inference to C++, llama.cpp is now a complete inference engine supporting 50+ model architectures including Llama 3, Mistral, Gemma, Phi, Qwen, DeepSeek, and more. The binary has zero Python dependencies and runs on any platform that can compile C++.

A single executable handles the entire inference pipeline: model loading, quantization, KV-cache management, context windowing, and sampling strategies (greedy, top-k, top-p, Mirostat). No virtual environment, no pip, no runtime to maintain.

GGUF Model Format

llama.cpp defined the GGUF format (successor to GGML), now the de-facto standard for quantized LLM distribution. Models are single files containing weights, tokenizer vocabulary, and all necessary metadata — everything needed for inference is self-contained.

Available in quantization levels from Q2_K (smallest, most quality-degraded) through Q8_0 (near-lossless, ~99% of FP16 quality). The most popular quantized GGUF builds are hosted on HuggingFace by the bartowski and ggml-org organizations, covering nearly every major open-weight model.

Hardware Flexibility

CPU inference uses AVX2/AVX-512 SIMD optimization for maximum throughput on x86 hardware. GPU backends include CUDA (NVIDIA), Metal (Apple Silicon), ROCm/HIPBlas (AMD), and Vulkan (cross-platform, any GPU).

A particularly powerful feature is partial GPU offloading: the -ngl flag controls how many transformer layers run on GPU versus CPU RAM. This lets you split a model that doesn't fully fit in VRAM — layers on GPU run at full GPU speed, overflow layers run on CPU. Practical for running 13B+ models on a consumer GPU.

Installation

llama.cpp can be installed via package managers or compiled from source. Compiling from source is required for cutting-edge model support and GPU backends on Linux. On macOS, Homebrew provides a convenient pre-built option with Metal acceleration.

1

Install via Homebrew (easiest)

The recommended path for most macOS users. Metal GPU acceleration is enabled automatically on Apple Silicon.

brew install llama.cpp

This installs llama-server, llama-cli, and all tools to /opt/homebrew/bin/ (Apple Silicon) or /usr/local/bin/ (Intel). No PATH changes needed.

2

Alternative — compile from source (latest features)

Needed for the very latest model support or experimental features not yet in the Homebrew formula.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

Binaries will be in build/bin/. Metal (-DGGML_METAL=ON) enables GPU acceleration on Apple Silicon.

3

Verify the installation

llama-server --version

You should see the version string and the list of enabled backends (Metal will appear for Apple Silicon Homebrew installs).

Note on Homebrew vs. Source

The Homebrew build is usually a few commits behind the main branch. Compile from source if you need the latest model architectures or experimental sampling features.

1

Install build dependencies

sudo apt update && sudo apt install -y \
  build-essential cmake git curl \
  libcurl4-openssl-dev
2

Clone the repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
3

Build — choose your backend

CPU-only build:

cmake -B build
cmake --build build --config Release -j$(nproc)

NVIDIA CUDA build (requires CUDA Toolkit 12.x):

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Verify CUDA: ./build/bin/llama-server --version should show CUDA in the backend list.

AMD ROCm build:

cmake -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS="gfx1100"
cmake --build build --config Release -j$(nproc)

Replace gfx1100 with your GPU's target: gfx906 for Vega, gfx1030 for RX 6000 series.

4

Add to PATH (optional)

echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Without this, invoke binaries as ./build/bin/llama-server from inside the repo directory.

1

Pull the official server image

docker pull ghcr.io/ggerganov/llama.cpp:server
2

Create a models directory

mkdir -p ~/llama-models
3

Download a GGUF model

Download a .gguf file into ~/llama-models before running the container. See Section 3 below for download instructions.

4

Run the server (CPU)

docker run -p 8080:8080 \
  -v ~/llama-models:/models \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 4096 -np 4
5

NVIDIA GPU (add --gpus all and use the CUDA image tag)

docker run --gpus all \
  -p 8080:8080 \
  -v ~/llama-models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 4096 -ngl 99 -np 4

The :server-cuda tag is pre-built with CUDA support. -ngl 99 offloads all layers to GPU.

Downloading GGUF Models

GGUF models are hosted on HuggingFace. The bartowski user maintains high-quality quantized builds of most popular open-weight models. Install the HuggingFace CLI to download specific quantization files without pulling the entire repository:

# Install HuggingFace CLI
pip install huggingface_hub

# Download a specific quantization (Q4_K_M recommended for balance)
huggingface-cli download \
  bartowski/Llama-3.2-3B-Instruct-GGUF \
  --include "Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
  --local-dir ~/llama-models

# Download Mistral 7B Q5_K_M
huggingface-cli download \
  bartowski/Mistral-7B-Instruct-v0.3-GGUF \
  --include "Mistral-7B-Instruct-v0.3-Q5_K_M.gguf" \
  --local-dir ~/llama-models

Or download directly via wget using the URL from the HuggingFace file page:

# Download directly via wget (get the URL from the HuggingFace file page)
wget -P ~/llama-models \
  "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"

Quantization Reference

The suffix in a GGUF filename indicates quantization level. Lower quantization = smaller file + faster inference + lower quality. Q4_K_M is the recommended default for most use cases, offering a practical balance of size, speed, and output quality.

Quantization Bits/weight 7B Model Size Quality vs FP16 Best For
Q2_K~2.6~2.8 GB~75%Extreme RAM constraints
Q3_K_M~3.4~3.3 GB~83%Very constrained RAM
Q4_K_M~4.8~4.4 GB~90%Recommended default
Q5_K_M~5.7~5.2 GB~94%High quality, moderate RAM
Q6_K~6.6~6.0 GB~97%Near-lossless, more VRAM
Q8_0~8.5~7.7 GB~99%Reference quality, benchmarking
F1616~14.5 GB100%Training, maximum accuracy

Running the Server

llama-server exposes an OpenAI-compatible HTTP API. It handles concurrent requests, streaming, and chat completions — compatible with any client that supports the OpenAI API format, including the openai Python SDK.

Basic launch:

# Basic server on port 8080
llama-server \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 4096        # context size in tokens \
  -np 4          # parallel request slots

With GPU offloading:

# NVIDIA: offload all layers to GPU
llama-server \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 8192 -np 4 \
  -ngl 99        # number of GPU layers (99 = all)

# Partial offload (model too big for full VRAM)
llama-server -m model.gguf -ngl 24 -c 4096

Test with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-cpp",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Key Server Flags

FlagShortDefaultDescription
--model-mPath to .gguf file (required)
--host127.0.0.1Bind address (use 0.0.0.0 for LAN)
--port8080HTTP port
--ctx-size-c512Context window in tokens
--n-gpu-layers-ngl0Layers to offload to GPU (99 = all)
--parallel-np1Simultaneous request slots
--threads-tautoCPU threads for generation
--batch-size-b512Prompt processing batch size
--flash-attn-faoffEnable FlashAttention (faster, less VRAM)
--api-keynoneBearer token for API authentication
--log-disableoffDisable verbose startup logs

Python Bindings

llama-cpp-python wraps llama.cpp in Python, supporting the OpenAI SDK interface and direct integration with llama_index and LangChain. It compiles the C++ library at install time, so GPU flags must be passed via CMAKE_ARGS.

# Install (CPU)
pip install llama-cpp-python

# Install with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# Install with Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
from llama_cpp import Llama

llm = Llama(
    model_path="~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=-1,   # -1 = all layers on GPU
    n_ctx=4096,        # context window
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain attention in transformers."},
    ]
)
print(response["choices"][0]["message"]["content"])

Troubleshooting

CUDA not detected

After building with -DGGML_CUDA=ON, run ./build/bin/llama-server --version and check for CUDA in the output. Ensure CUDA Toolkit 12.x is installed (nvcc --version) and that CUDA libraries are in LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

GGML_ASSERT crash on load

Usually a corrupt or truncated download. Re-download the GGUF file and verify the size matches the HuggingFace listing. Check the checksum:

sha256sum model.gguf

Compare against the .sha256 value shown on the HuggingFace file page.

Out of memory (OOM) with GPU

Reduce -ngl to offload fewer layers to GPU (e.g., -ngl 24 instead of 99). The remaining layers run on CPU RAM. Alternatively, switch to a more aggressively quantized model (Q4_K_M → Q3_K_M), or reduce -c context size, which also consumes VRAM for the KV cache.

Very slow CPU inference

AVX2 is required for fast CPU inference. Check: grep avx2 /proc/cpuinfo. If you accidentally built with the CUDA flag but CUDA is misconfigured, the CPU fallback may use a slow non-AVX code path — rebuild without -DGGML_CUDA=ON. Also reduce -np parallel slots on CPU-only machines to avoid memory thrashing.

llama-server: command not found

When compiled from source, binaries live in build/bin/ inside the repo. Run as ./build/bin/llama-server, or add the directory to PATH:

export PATH="$HOME/llama.cpp/build/bin:$PATH"

After brew install llama.cpp, the command is available globally — no PATH changes needed.

Model produces garbage output

Ensure you are using the correct prompt template for the model. Instruct-tuned models require chat format wrappers — <|im_start|> / <|im_end|> (ChatML), [INST] (Mistral), or <|start_header_id|> (Llama 3). The /v1/chat/completions endpoint applies the template automatically from model metadata; the raw /completion endpoint does not.