A local model runner that bet on packaging
Ollama is a Go binary that downloads, runs, and serves open large language models on your own machine behind a single command. Its wager, from the first commit, was that the model was never the hard part. The friction was everything around it: Python environments, weight conversion, quantization choices, GPU glue. Ollama swallows that and hands you a prompt.
Why it caught on
Running models locally keeps your prompts and data on your machine, costs nothing per token, and works offline. Plenty of projects make that claim. Ollama’s edge is the on-ramp: it pulls quantized weights by name, picks defaults for your hardware, and exposes both a CLI and a local REST API on localhost:11434, so the same model backs your terminal and your apps. Since mid-2023 it has gone from a single star to 173,589 (as of 2026-06), and it is still shipping, with v0.30.7 tagged in June 2026. The curve above is sampled, so read it for shape rather than exact dates: a fast early climb that has not flattened.
What it does today
- One command,
ollama, that starts you running a model or wires Ollama into a tool you already use. - A model library you pull by name, with quantization handled for you.
- A local REST API, plus official
ollama-pythonandollama-jsclients. - Inference rides on the llama.cpp backend, so model coverage tracks what llama.cpp supports.
- MIT licensed, with a busy tracker (3,366 open issues as of 2026-06).
Install
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows (PowerShell):
irm https://ollama.com/install.ps1 | iex
The official ollama/ollama image is on Docker Hub for headless and server use, and each platform has a manual download linked from ollama.com/download.
First run
ollama run gemma4
That pulls the model on first use and drops you into a chat. To drive it from code, call the REST API or install a client:
pip install ollama
A minimal Python call once the client is installed:
from ollama import chat
reply = chat(model="gemma4", messages=[{"role": "user", "content": "Why is the sky blue?"}])
print(reply.message.content)
The same request over the REST API, with no client at all:
curl http://localhost:11434/api/chat -d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"stream": false
}'
See ollama.com/library for the models you can pull by name, and the quickstart guide for the rest.
The quiet pivot to launching other tools
A recent change worth noting: typing ollama no longer just lists models, it offers to connect Ollama to tools you already run. ollama launch claude points Claude Code at your local endpoint, and the same pattern covers Codex, Copilot CLI, Droid, and OpenCode. OpenClaw goes further, turning a local model into an assistant reachable from WhatsApp, Telegram, Slack, and Discord. The product is drifting from “a way to run a model” toward “the local backend other agents plug into,” which is a more defensible place to sit as hosted APIs commoditize.
Where it fits, and where it doesn’t
Reach for Ollama when you want models on a laptop or a single box: prototyping, private data, or wiring a local backend into an editor or agent. It is not built for high-throughput, multi-GPU serving, where vLLM gets far more tokens per second out of the same hardware. It also defaults to quantized weights, so evaluation work that needs full precision is something you configure yourself.
Ollama versus other local runtimes
| Ollama | llama.cpp | vLLM | SGLang | LM Studio | |
|---|---|---|---|---|---|
| Stars | 173,589 | 115,604 | 82,233 | 28,904 | closed source |
| Forks | 16,512 | 19,348 | 17,806 | 6,424 | n/a |
| Open issues | 3,366 | 1,763 | 5,269 | 3,811 | n/a |
| Language | Go | C++ | Python | Python | n/a |
| License | MIT | MIT | Apache-2.0 | Apache-2.0 | proprietary |
| Best at | local dev and app backends | embedding in your own binary | high-throughput GPU serving | high-throughput GPU serving | point-and-click chat |
Counts are from GitHub as of June 2026. llama.cpp is the engine Ollama builds on, so the comparison is about how you drive it, not raw speed. For maximum tokens per second on multi-GPU hosts, vLLM and SGLang are the heavier-duty options, both built for serving rather than single-box convenience. LM Studio is a closed-source desktop app with no public repository to measure.
What the issue tracker tells you before you commit
The README is a download page. The open issues are where the trade-offs live, and the most-upvoted ones cluster around a handful of honest limits:
- Hardware acceleration is narrower than “local” implies. Because inference runs through llama.cpp, mainstream NVIDIA and recent AMD and Apple GPUs are well covered, but the loudest open requests are for AMD and Intel NPUs, Intel Arc GPUs, and older AMD cards such as the gfx803 generation. On that hardware you can quietly fall back to CPU, which changes the experience completely. Check your GPU against support before assuming acceleration.
- The default context window is modest, and it truncates silently. Managing context length per model is a recurring request on the tracker. If you paste a long prompt and the model seems to forget the start, this is usually why. You raise it through the model’s parameters in a Modelfile, not a single global flag.
- It makes itself at home. One of the most-reacted issues asks it to stop placing models and config under your home directory, a real problem on machines with small system drives. Another asks for a way to disable the background service it sets up to start at login. Both are configurable, neither is the default.
- Some capabilities are requests, not features. Model Context Protocol support, reranking models, image generation, and a Prometheus
/metricsendpoint are all near the top of the open issues rather than shipped. - Pulling models is all-or-nothing on bandwidth. There is no built-in throttle for the multi-gigabyte downloads, which bites on metered or shared connections.
None of these are disqualifying. They are the gap between the marketing surface and the first week of actually living with it.
Related
If you train or fine-tune in a framework like TensorFlow, Ollama is one way to serve the open models you land on. It also pairs with AI-native editors that point their assistant at a local Ollama endpoint instead of a hosted API, an approach the now-archived Void helped popularize. For the wider picture, see what else is moving in LLM tooling.
FAQ
Is Ollama free? Yes. It is MIT-licensed and runs locally with no per-token cost.
Does Ollama use my GPU? On mainstream NVIDIA and recent AMD and Apple GPUs, yes. On NPUs, Intel Arc, and some older cards it may fall back to CPU, and those are among the most requested open issues.
Why does the model seem to forget long prompts? The default context window is limited and truncates silently. Raise it for a given model through its Modelfile parameters rather than expecting a long prompt to fit by default.
Where does Ollama store models? Under your home directory by default, which is a common complaint on machines with small system drives. The location can be reconfigured.