Ollama — Local LLM Serving

Install, run, and manage open-source language models on your own hardware

Ollama LLM Local AI macOS Linux Docker
Local LLM Inference Series
⏱ 12 min read 📊 Intermediate 🗓 Updated Jan 2025

What is Ollama?

Ollama is an open-source tool that makes running large language models locally as simple as ollama run llama3.2. It abstracts away model downloading, quantization format handling, hardware acceleration configuration, and server management — wrapping everything into a single binary with a clean CLI and a REST API that mirrors OpenAI's interface.

Zero-Config Model Management

ollama pull downloads and manages GGUF-quantized models from the Ollama library. No manual GGUF hunting, no config files. Models are stored in ~/.ollama/models and indexed automatically. Supports llama3.2, mistral, gemma3, phi4, qwen2.5, deepseek-r1, and 100+ others — browse the full library at ollama.com/library.

OpenAI-Compatible REST API

Ollama runs a local HTTP server on port 11434. It exposes /api/generate and /api/chat (Ollama-native) as well as the OpenAI-compatible /v1/chat/completions and /v1/models endpoints. This means you can use Ollama as a drop-in replacement for the OpenAI SDK by simply setting base_url="http://localhost:11434/v1" — no other code changes needed.

Hardware Acceleration

Ollama automatically detects and uses available GPU hardware: Apple Metal on macOS (both Apple Silicon and Intel GPUs), NVIDIA CUDA on Linux, and AMD ROCm on Linux. If no compatible GPU is found it falls back gracefully to CPU inference. GPU layer detection is fully automatic — no environment variables or driver configuration required for standard setups.

Installation

1

Install via Homebrew

Run the following command to install the Ollama CLI. Note: this installs the CLI only — no background service is started yet.

brew install ollama
2

Start the Ollama server

Choose how you want to run the server. Option A runs it in the foreground (useful for debugging); Option B registers it as a background service that survives reboots.

# Option A — run in the foreground (Ctrl-C to stop)
ollama serve

# Option B — run as a background launchd service
brew services start ollama
3

Pull your first model

This downloads llama3.2:3b by default (~2 GB, Q4_K_M quantization). Use llama3.2:1b if you have less than 4 GB of RAM.

ollama pull llama3.2
4

Chat interactively

Opens an interactive REPL. Type /bye to exit, or press Ctrl-D.

ollama run llama3.2
5

Verify the API is responding

The /api/tags endpoint returns a list of downloaded models in JSON format.

curl http://localhost:11434/api/tags

Tip: Ollama Desktop App

Download Ollama.app from ollama.com for a menubar icon, automatic startup on login, and a GUI for managing models. It provides identical CLI functionality to the Homebrew install but is easier for everyday desktop use.

1

Run the official install script

The script installs the ollama binary to /usr/bin/ollama, creates a dedicated ollama system user, and registers a systemd service unit automatically.

curl -fsSL https://ollama.com/install.sh | sh
2

NVIDIA GPU setup (optional)

Ensure your NVIDIA driver is installed and working — run nvidia-smi to verify. Ollama auto-detects CUDA; no manual configuration is needed once drivers are present. Install drivers on Ubuntu if missing:

sudo apt install nvidia-driver-535
3

AMD GPU / ROCm setup (optional)

Set the GFX version override for your GPU generation (adjust 11.0.0 to match your GPU), then install ROCm from AMD's official repository. Ollama auto-detects ROCm if it is present on the system.

export HSA_OVERRIDE_GFX_VERSION=11.0.0
4

Start and enable the systemd service

Start Ollama immediately and configure it to start automatically on every boot.

sudo systemctl start ollama && sudo systemctl enable ollama
5

Check service status

sudo systemctl status ollama
6

Pull and run a model

ollama pull llama3.2 && ollama run llama3.2

Tip: Expose Ollama to Your LAN

By default Ollama only listens on 127.0.0.1. To expose it to other devices on your network, create a systemd override:

sudo systemctl edit ollama

Add the following in the editor that opens, then save and restart:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama
1

CPU-only container

The named volume ollama persists downloaded models between container restarts.

docker run -d \
  --name ollama \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama
2

NVIDIA GPU container

Requires nvidia-container-toolkit installed on the host. Passes all GPUs into the container.

docker run -d \
  --name ollama \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama
3

Pull a model into the container

docker exec -it ollama ollama pull llama3.2
4

Run interactively inside the container

docker exec -it ollama ollama run llama3.2
5

Docker Compose with Open WebUI

The following Compose file runs Ollama alongside Open WebUI — the most popular browser-based Ollama GUI. Access the UI at http://localhost:3000 after starting.

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    # Uncomment for NVIDIA GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:

Essential Commands

# List downloaded models
ollama list

# Pull a model (tag optional, defaults to latest)
ollama pull llama3.2
ollama pull llama3.2:1b
ollama pull mistral:7b-instruct
ollama pull gemma3:12b

# Run interactively
ollama run llama3.2

# Run with a system prompt
ollama run llama3.2 "Explain TCP/IP in one paragraph"

# Show model info and parameters
ollama show llama3.2

# Check running models (and VRAM usage)
ollama ps

# Remove a model
ollama rm llama3.2

# Copy / create a variant
ollama cp llama3.2 my-custom-llama

REST API

Ollama's REST API is available at http://localhost:11434. The native endpoints use Ollama's own JSON schema, while the /v1/* endpoints are fully OpenAI-compatible — making it easy to swap Ollama into any existing OpenAI-based project.

# Simple generation
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "What is a transformer model?",
    "stream": false
  }'

# Chat completions (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantization briefly."}
    ]
  }'

Python SDK Example

Point the OpenAI Python SDK at your local Ollama instance by changing base_url. The api_key value is required by the SDK but ignored by Ollama.

from openai import OpenAI

# Point the OpenAI SDK at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK, value ignored
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello, world!"}],
)
print(response.choices[0].message.content)

Popular Models

Model Tag Size Best For
Llama 3.2 3B llama3.2:3b ~2 GB Fast chat, low-RAM systems
Llama 3.2 1B llama3.2:1b ~0.9 GB Edge devices, very fast
Llama 3.1 8B llama3.1:8b ~4.7 GB General purpose, balanced
Llama 3.1 70B llama3.1:70b ~40 GB High capability, needs big GPU
Mistral 7B Instruct mistral:7b-instruct ~4.1 GB Instruction following
Gemma 3 12B gemma3:12b ~8 GB Google's capable mid-size model
Phi-4 phi4:14b ~9 GB Microsoft, strong reasoning
Qwen 2.5 Coder qwen2.5-coder:7b ~4.7 GB Code generation/completion
DeepSeek-R1 deepseek-r1:7b ~5 GB Chain-of-thought reasoning
Nomic Embed nomic-embed-text ~274 MB Text embeddings for RAG

Troubleshooting

"connection refused" on port 11434

The Ollama server is not running. Start it with ollama serve (foreground) or brew services start ollama / systemctl start ollama (background). Also check the OLLAMA_HOST environment variable if you have customised the bind address — the port in your request must match.

Model runs on CPU instead of GPU

GPU was not detected. On Linux, verify nvidia-smi works and CUDA libraries are installed. On macOS, Apple Silicon uses Metal automatically; Intel Macs fall back to CPU. Run ollama ps while the model is loaded to see how many layers are offloaded to the GPU.

Out of memory / model doesn't load

The model is too large for available VRAM. Try a smaller quantization: ollama pull llama3.2:3b instead of 8B, or look for a :q4 tag. Check free VRAM with nvidia-smi before pulling. Models will partially offload to RAM but performance degrades significantly.

Slow generation speed

Running on CPU expect 2–10 tokens/sec; GPU gives 30–100+ t/s. Ensure no other GPU-heavy processes are competing for VRAM. Use ollama ps to verify the GPU is being utilised. Closing browser tabs with WebGL content can free GPU memory on macOS.

ollama: command not found after install script

Restart your shell or run source ~/.bashrc (or ~/.zshrc). The installer adds /usr/local/bin to PATH via your shell profile, which only takes effect in new shell sessions.

Docker container exits immediately

Check the container logs with docker logs ollama. If you passed --gpus=all but nvidia-container-toolkit is not installed on the host, Docker cannot fulfil the GPU request and the container will exit. Remove the flag to run CPU-only, or install the toolkit first.