Install, run, and manage open-source language models on your own hardware
Ollama is an open-source tool that makes running large language models locally as simple as ollama run llama3.2. It abstracts away model downloading, quantization format handling, hardware acceleration configuration, and server management — wrapping everything into a single binary with a clean CLI and a REST API that mirrors OpenAI's interface.
ollama pull downloads and manages GGUF-quantized models from the Ollama library. No manual GGUF hunting, no config files. Models are stored in ~/.ollama/models and indexed automatically. Supports llama3.2, mistral, gemma3, phi4, qwen2.5, deepseek-r1, and 100+ others — browse the full library at ollama.com/library.
Ollama runs a local HTTP server on port 11434. It exposes /api/generate and /api/chat (Ollama-native) as well as the OpenAI-compatible /v1/chat/completions and /v1/models endpoints. This means you can use Ollama as a drop-in replacement for the OpenAI SDK by simply setting base_url="http://localhost:11434/v1" — no other code changes needed.
Ollama automatically detects and uses available GPU hardware: Apple Metal on macOS (both Apple Silicon and Intel GPUs), NVIDIA CUDA on Linux, and AMD ROCm on Linux. If no compatible GPU is found it falls back gracefully to CPU inference. GPU layer detection is fully automatic — no environment variables or driver configuration required for standard setups.
Run the following command to install the Ollama CLI. Note: this installs the CLI only — no background service is started yet.
brew install ollama
Choose how you want to run the server. Option A runs it in the foreground (useful for debugging); Option B registers it as a background service that survives reboots.
# Option A — run in the foreground (Ctrl-C to stop) ollama serve # Option B — run as a background launchd service brew services start ollama
This downloads llama3.2:3b by default (~2 GB, Q4_K_M quantization). Use llama3.2:1b if you have less than 4 GB of RAM.
ollama pull llama3.2
Opens an interactive REPL. Type /bye to exit, or press Ctrl-D.
ollama run llama3.2
The /api/tags endpoint returns a list of downloaded models in JSON format.
curl http://localhost:11434/api/tags
Download Ollama.app from ollama.com for a menubar icon, automatic startup on login, and a GUI for managing models. It provides identical CLI functionality to the Homebrew install but is easier for everyday desktop use.
The script installs the ollama binary to /usr/bin/ollama, creates a dedicated ollama system user, and registers a systemd service unit automatically.
curl -fsSL https://ollama.com/install.sh | sh
Ensure your NVIDIA driver is installed and working — run nvidia-smi to verify. Ollama auto-detects CUDA; no manual configuration is needed once drivers are present. Install drivers on Ubuntu if missing:
sudo apt install nvidia-driver-535
Set the GFX version override for your GPU generation (adjust 11.0.0 to match your GPU), then install ROCm from AMD's official repository. Ollama auto-detects ROCm if it is present on the system.
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Start Ollama immediately and configure it to start automatically on every boot.
sudo systemctl start ollama && sudo systemctl enable ollama
sudo systemctl status ollama
ollama pull llama3.2 && ollama run llama3.2
By default Ollama only listens on 127.0.0.1. To expose it to other devices on your network, create a systemd override:
sudo systemctl edit ollama
Add the following in the editor that opens, then save and restart:
[Service] Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama
The named volume ollama persists downloaded models between container restarts.
docker run -d \ --name ollama \ -v ollama:/root/.ollama \ -p 11434:11434 \ ollama/ollama
Requires nvidia-container-toolkit installed on the host. Passes all GPUs into the container.
docker run -d \ --name ollama \ --gpus=all \ -v ollama:/root/.ollama \ -p 11434:11434 \ ollama/ollama
docker exec -it ollama ollama pull llama3.2
docker exec -it ollama ollama run llama3.2
The following Compose file runs Ollama alongside Open WebUI — the most popular browser-based Ollama GUI. Access the UI at http://localhost:3000 after starting.
services:
ollama:
image: ollama/ollama
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
# Uncomment for NVIDIA GPU:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open_webui_data:/app/backend/data
depends_on:
- ollama
volumes:
ollama_data:
open_webui_data:
# List downloaded models ollama list # Pull a model (tag optional, defaults to latest) ollama pull llama3.2 ollama pull llama3.2:1b ollama pull mistral:7b-instruct ollama pull gemma3:12b # Run interactively ollama run llama3.2 # Run with a system prompt ollama run llama3.2 "Explain TCP/IP in one paragraph" # Show model info and parameters ollama show llama3.2 # Check running models (and VRAM usage) ollama ps # Remove a model ollama rm llama3.2 # Copy / create a variant ollama cp llama3.2 my-custom-llama
Ollama's REST API is available at http://localhost:11434. The native endpoints use Ollama's own JSON schema, while the /v1/* endpoints are fully OpenAI-compatible — making it easy to swap Ollama into any existing OpenAI-based project.
# Simple generation curl http://localhost:11434/api/generate \ -d '{ "model": "llama3.2", "prompt": "What is a transformer model?", "stream": false }' # Chat completions (OpenAI-compatible endpoint) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantization briefly."} ] }'
Point the OpenAI Python SDK at your local Ollama instance by changing base_url. The api_key value is required by the SDK but ignored by Ollama.
from openai import OpenAI # Point the OpenAI SDK at your local Ollama instance client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama", # required by SDK, value ignored ) response = client.chat.completions.create( model="llama3.2", messages=[{"role": "user", "content": "Hello, world!"}], ) print(response.choices[0].message.content)
| Model | Tag | Size | Best For |
|---|---|---|---|
| Llama 3.2 3B | llama3.2:3b |
~2 GB | Fast chat, low-RAM systems |
| Llama 3.2 1B | llama3.2:1b |
~0.9 GB | Edge devices, very fast |
| Llama 3.1 8B | llama3.1:8b |
~4.7 GB | General purpose, balanced |
| Llama 3.1 70B | llama3.1:70b |
~40 GB | High capability, needs big GPU |
| Mistral 7B Instruct | mistral:7b-instruct |
~4.1 GB | Instruction following |
| Gemma 3 12B | gemma3:12b |
~8 GB | Google's capable mid-size model |
| Phi-4 | phi4:14b |
~9 GB | Microsoft, strong reasoning |
| Qwen 2.5 Coder | qwen2.5-coder:7b |
~4.7 GB | Code generation/completion |
| DeepSeek-R1 | deepseek-r1:7b |
~5 GB | Chain-of-thought reasoning |
| Nomic Embed | nomic-embed-text |
~274 MB | Text embeddings for RAG |
The Ollama server is not running. Start it with ollama serve (foreground) or brew services start ollama / systemctl start ollama (background). Also check the OLLAMA_HOST environment variable if you have customised the bind address — the port in your request must match.
GPU was not detected. On Linux, verify nvidia-smi works and CUDA libraries are installed. On macOS, Apple Silicon uses Metal automatically; Intel Macs fall back to CPU. Run ollama ps while the model is loaded to see how many layers are offloaded to the GPU.
The model is too large for available VRAM. Try a smaller quantization: ollama pull llama3.2:3b instead of 8B, or look for a :q4 tag. Check free VRAM with nvidia-smi before pulling. Models will partially offload to RAM but performance degrades significantly.
Running on CPU expect 2–10 tokens/sec; GPU gives 30–100+ t/s. Ensure no other GPU-heavy processes are competing for VRAM. Use ollama ps to verify the GPU is being utilised. Closing browser tabs with WebGL content can free GPU memory on macOS.
ollama: command not found after install scriptRestart your shell or run source ~/.bashrc (or ~/.zshrc). The installer adds /usr/local/bin to PATH via your shell profile, which only takes effect in new shell sessions.
Check the container logs with docker logs ollama. If you passed --gpus=all but nvidia-container-toolkit is not installed on the host, Docker cannot fulfil the GPU request and the container will exit. Remove the flag to run CPU-only, or install the toolkit first.