ollama/ollama: run open LLMs locally, and the trade-offs the README skips

A local model runner that bet on packaging

Ollama is a Go binary that downloads, runs, and serves open large language models on your own machine behind a single command. Its wager, from the first commit, was that the model was never the hard part. The friction was everything around it: Python environments, weight conversion, quantization choices, GPU glue. Ollama swallows that and hands you a prompt.

Why it caught on

Running models locally keeps your prompts and data on your machine, costs nothing per token, and works offline. Plenty of projects make that claim. Ollama’s edge is the on-ramp: it pulls quantized weights by name, picks defaults for your hardware, and exposes both a CLI and a local REST API on localhost:11434, so the same model backs your terminal and your apps. Since mid-2023 it has gone from a single star to 173,589 (as of 2026-06), and it is still shipping, with v0.30.7 tagged in June 2026. The curve above is sampled, so read it for shape rather than exact dates: a fast early climb that has not flattened.

What it does today

One command, ollama, that starts you running a model or wires Ollama into a tool you already use.
A model library you pull by name, with quantization handled for you.
A local REST API, plus official ollama-python and ollama-js clients.
Inference rides on the llama.cpp backend, so model coverage tracks what llama.cpp supports.
MIT licensed, with a busy tracker (3,366 open issues as of 2026-06).

Install

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows (PowerShell):

irm https://ollama.com/install.ps1 | iex

The official ollama/ollama image is on Docker Hub for headless and server use, and each platform has a manual download linked from ollama.com/download.

First run

ollama run gemma4

That pulls the model on first use and drops you into a chat. To drive it from code, call the REST API or install a client:

pip install ollama

A minimal Python call once the client is installed:

from ollama import chat

reply = chat(model="gemma4", messages=[{"role": "user", "content": "Why is the sky blue?"}])
print(reply.message.content)

The same request over the REST API, with no client at all:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}],
  "stream": false
}'

See ollama.com/library for the models you can pull by name, and the quickstart guide for the rest.

The quiet pivot to launching other tools

A recent change worth noting: typing ollama no longer just lists models, it offers to connect Ollama to tools you already run. ollama launch claude points Claude Code at your local endpoint, and the same pattern covers Codex, Copilot CLI, Droid, and OpenCode. OpenClaw goes further, turning a local model into an assistant reachable from WhatsApp, Telegram, Slack, and Discord. The product is drifting from “a way to run a model” toward “the local backend other agents plug into,” which is a more defensible place to sit as hosted APIs commoditize.

Where it fits, and where it doesn’t

Reach for Ollama when you want models on a laptop or a single box: prototyping, private data, or wiring a local backend into an editor or agent. It is not built for high-throughput, multi-GPU serving, where vLLM gets far more tokens per second out of the same hardware. It also defaults to quantized weights, so evaluation work that needs full precision is something you configure yourself.

Ollama versus other local runtimes

	Ollama	llama.cpp	vLLM	SGLang	LM Studio
Stars	173,589	115,604	82,233	28,904	closed source
Forks	16,512	19,348	17,806	6,424	n/a
Open issues	3,366	1,763	5,269	3,811	n/a
Language	Go	C++	Python	Python	n/a
License	MIT	MIT	Apache-2.0	Apache-2.0	proprietary
Best at	local dev and app backends	embedding in your own binary	high-throughput GPU serving	high-throughput GPU serving	point-and-click chat

Counts are from GitHub as of June 2026. llama.cpp is the engine Ollama builds on, so the comparison is about how you drive it, not raw speed. For maximum tokens per second on multi-GPU hosts, vLLM and SGLang are the heavier-duty options, both built for serving rather than single-box convenience. LM Studio is a closed-source desktop app with no public repository to measure.

What the issue tracker tells you before you commit

The README is a download page. The open issues are where the trade-offs live, and the most-upvoted ones cluster around a handful of honest limits:

Hardware acceleration is narrower than “local” implies. Because inference runs through llama.cpp, mainstream NVIDIA and recent AMD and Apple GPUs are well covered, but the loudest open requests are for AMD and Intel NPUs, Intel Arc GPUs, and older AMD cards such as the gfx803 generation. On that hardware you can quietly fall back to CPU, which changes the experience completely. Check your GPU against support before assuming acceleration.
The default context window is modest, and it truncates silently. Managing context length per model is a recurring request on the tracker. If you paste a long prompt and the model seems to forget the start, this is usually why. You raise it through the model’s parameters in a Modelfile, not a single global flag.
It makes itself at home. One of the most-reacted issues asks it to stop placing models and config under your home directory, a real problem on machines with small system drives. Another asks for a way to disable the background service it sets up to start at login. Both are configurable, neither is the default.
Some capabilities are requests, not features. Model Context Protocol support, reranking models, image generation, and a Prometheus /metrics endpoint are all near the top of the open issues rather than shipped.
Pulling models is all-or-nothing on bandwidth. There is no built-in throttle for the multi-gigabyte downloads, which bites on metered or shared connections.

None of these are disqualifying. They are the gap between the marketing surface and the first week of actually living with it.

If you train or fine-tune in a framework like TensorFlow, Ollama is one way to serve the open models you land on. It also pairs with AI-native editors that point their assistant at a local Ollama endpoint instead of a hosted API, an approach the now-archived Void helped popularize. For the wider picture, see what else is moving in LLM tooling.

FAQ

Is Ollama free? Yes. It is MIT-licensed and runs locally with no per-token cost.

Does Ollama use my GPU? On mainstream NVIDIA and recent AMD and Apple GPUs, yes. On NPUs, Intel Arc, and some older cards it may fall back to CPU, and those are among the most requested open issues.

Why does the model seem to forget long prompts? The default context window is limited and truncates silently. Raise it for a given model through its Modelfile parameters rather than expecting a long prompt to fit by default.

Where does Ollama store models? Under your home directory by default, which is a common complaint on machines with small system drives. The location can be reconfigured.

ollama/ollama: run open LLMs locally, and the trade-offs the README skips

Star growth

A local model runner that bet on packaging

Why it caught on

What it does today

Install

First run

The quiet pivot to launching other tools

Where it fits, and where it doesn’t

Ollama versus other local runtimes

What the issue tracker tells you before you commit

FAQ

Momentum

Repository data

ollama/ollama: run open LLMs locally, and the trade-offs the README skips

Star growth

A local model runner that bet on packaging

Why it caught on

What it does today

Install

First run

The quiet pivot to launching other tools

Where it fits, and where it doesn’t

Ollama versus other local runtimes

What the issue tracker tells you before you commit

Related

FAQ

Momentum

Repository data