Ollama Setup Guide 2026: Install, GPU, Models, API

Ollama Setup Guide 2026: Install, GPU, Models, API

If you have been trying to run AI models on your own machine, you have probably come across Ollama. It is basically the easiest way to get a large language model running locally — no API key, no cloud bill, no data leaving your computer. One command installs it, another command pulls a model, and within a few minutes you are chatting with Llama 4 or Gemma 4 right from your terminal.

I spent a few weeks going through the whole setup — GPU issues, Docker networking problems, the Open WebUI trick that nobody mentions upfront, all of it. This guide covers everything from the first install to using Ollama from Python, Java, and calling it via REST API. So whether you just heard about Ollama or you got stuck somewhere in the middle, this should help.

One thing before we start: as of late June 2026, the current stable version is Ollama v0.22.1 (released May 3, 2026). There is also a 0.30 RC line being actively developed, but it has some open issues with specific models, so most people should stay on 0.22.x for now.

Installing Ollama

The install itself is pretty fast. On macOS and Linux, one command handles everything:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, you download the setup file from ollama.com and run it like any other installer. There is also a winget option if you prefer that. After install, Ollama runs as a background service. You can check it is working by running:

ollama --version

If ollama is not recognized after installation, a new terminal session may be needed for PATH changes to take effect. That caught me the first time — I kept wondering why the command was not found.

Once it is running, pull your first model:

ollama pull llama3.2:3b
ollama run llama3.2:3b

The 3B model is about 2 GB, downloads reasonably fast, and runs on almost any hardware. Good starting point before you try anything bigger.

How Much Hardware Do You Actually Need

Ollama’s minimum requirement is 8 GB of RAM, 10 GB of free disk, and a 64-bit CPU with AVX2 — no GPU required. For comfortable daily use, target 16 GB RAM plus an 8–12 GB GPU (NVIDIA RTX 3060/4060 or a 16 GB Apple Silicon Mac), which runs 7B–14B models at 30–60 tokens per second. At Q4 quantization: 8 GB runs 7B models, 16 GB runs 13B–14B, and 24 GB or more runs 32B.

The tricky part is that running on CPU-only is technically possible, but it is slow. Like, annoyingly slow. On an old laptop with no GPU, Llama 3 8B runs at about five tokens per second — about 47 seconds for a response. Same model on a desktop with CUDA drivers: roughly three seconds. If you are serious about using this regularly, getting at least a mid-range GPU makes a huge difference.

The rule of thumb for models is: a 7B parameter model at the default Q4_K_M quantization needs about 5–6 GB of VRAM. A 14B needs 10–12 GB. Anything 32B and above needs 24 GB or more, which usually means an RTX 4090 or dual-GPU setup. Most people running 7B and 14B models on an RTX 3060 12GB or RTX 4060 16GB are having a fine experience.

GPU Setup: NVIDIA, AMD, and Apple Silicon

This is where most people run into trouble, so let me break it down.

NVIDIA is the easiest. Ollama supports NVIDIA GPUs with compute capability 5.0 and above, with driver version 550 or newer. For older GPUs with compute capability 5.0 through 6.2, driver version 570 is needed. That covers GTX 900 series and anything newer. Install the drivers, install Ollama, and GPU acceleration just works automatically. You do not configure anything extra.

If Ollama is not using your NVIDIA GPU after install, the most common reason is a driver version mismatch. Run nvidia-smi first — if that itself does not work, fix your drivers before debugging anything else. After install, run ollama ps — if it shows 100% CPU, Ollama is not detecting your GPU. Also, one specific issue to watch: on Linux, after a suspend/resume cycle, Ollama sometimes fails to discover your NVIDIA GPU and falls back to CPU. You can work around this by reloading the UVM driver with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm.

AMD is more involved. Ollama requires AMD ROCm v7 driver on Linux. RX 6000 and RX 7000 series are the safest bet. Not all AMD GPUs are officially supported, and some older cards need an extra environment variable to force compatibility. For example, if you have an RX 5400, you can try HSA_OVERRIDE_GFX_VERSION=10.3.0 to make it run as if it were a supported target — but it does not always work.

Apple Silicon is actually the nicest experience. Zero configuration needed. Install Ollama, run a model, Metal acceleration kicks in automatically. Starting with Ollama 0.19 (shipped as a preview in March 2026), inference on Apple Silicon moved from llama.cpp to Apple’s MLX framework, pushing decode speeds up roughly 1.6–2x. You can enable it explicitly with OLLAMA_BACKEND=mlx ollama serve if needed, but for most M-series Macs it now activates on its own.

Which Models Should You Use

Ollama’s model library at ollama.com has a big list. But honestly, most people need one of four or five models, and the choice depends almost entirely on how much VRAM you have.

For small hardware (6–8 GB VRAM), Gemma 4 from Google and Phi-4-mini from Microsoft are the ones worth trying. Both are 4B-range models that start fast and work well for coding help and chat. Gemma 4 in particular was built for low-resource setups and the default Ollama pull keeps it compact.

For mid-range hardware (12–16 GB VRAM), the sweet spot is Llama 3.2 at 8B or Qwen2.5 at 7B. If you do a lot of coding specifically, qwen2.5-coder:7b is worth pulling — it is better at code than the general Llama models at the same size.

In 2026, OpenAI also released gpt-oss 20B natively in MXFP4, which lands around 16 GB of memory — a strong option for people with a 16 GB GPU who want something closer to frontier-model quality. That was a bit of a surprise when it dropped; not many people expected OpenAI to ship weights directly to Ollama.

One thing the guides do not mention clearly: watch out for the :cloud suffix. Ollama now offers cloud-hosted variants of some large models — you pull them and they download almost instantly because there is nothing to download locally. A May 2026 issue to be aware of: pulling a :cloud model variant runs on Ollama's servers, not your GPU. If your GPU is not engaging after pulling a model, check if the model tag ends with :cloud. And Ollama Cloud reliability has been rough through March and April 2026, with GitHub issues documenting sustained timeouts and empty responses on Cloud Pro plans. Running local weights is the safer option right now.

The REST API

Ollama is not just a CLI tool. It runs a local server on port 11434 and exposes a REST API that any application can call. This is how you connect your own code to local models.

There are two independent API surfaces: the native /api/* endpoints for full control, and /v1/* as a drop-in replacement for OpenAI code. The /v1/ path is the one you want if you are migrating existing OpenAI code, because you can literally just change the base URL and it works with the same client libraries.

The main endpoints are:

  • /api/generate — single-turn, takes a prompt string
  • /api/chat — multi-turn, takes an array of messages with roles
  • /api/embed — generates embeddings for RAG pipelines
  • /api/tags — lists all models you have pulled

A basic curl call looks like this:

curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain RAG in two sentences.",
"stream": false
}'

By default Ollama only accepts connections from localhost. To allow connections from other machines, set OLLAMA_HOST=0.0.0.0 before starting Ollama — and if you do this, add firewall rules, because by default Ollama has no authentication and anyone on the network can send requests.

Using Ollama from Python

There are two ways to talk to Ollama from Python. One is just using the requests library to call the REST API directly. The other is the official ollama Python package, which is cleaner.

Install it:

pip install ollama

Basic usage:

import ollama
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a code assistant."},
{"role": "user", "content": "Write a Python function to count word frequency."}
]
)
print(response['message']['content'])

The Python library also supports an AsyncClient for async code and a stream=True parameter for streaming responses chunk by chunk. Streaming is useful for chat applications where you want output to appear progressively rather than waiting for the full response.

One thing that tripped me up: when migrating from OpenAI to Ollama, the model name must match exactly what ollama list shows. Passing "llama3" instead of "llama3.2:3b" gives a 404 — the model name is local, not globally resolved like OpenAI's API.

If you want to use the OpenAI Python SDK with Ollama instead (useful if your codebase already uses it), just point the base URL at Ollama:

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by SDK but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

That is it — two lines changed, rest of your code stays the same.

Using Ollama from Java

For Java developers, Ollama works through HTTP, so any HTTP client works. Spring Boot projects typically use WebClient for non-blocking requests. The Ollama Spring AI integration handles the REST calls automatically if you add the dependency, but you can also call the API manually.

With Spring Boot’s WebClient:

WebClient client = WebClient.create("http://localhost:11434");
String body = """
{
"model": "llama3.2",
"prompt": "Summarize what a REST API is.",
"stream": false
}
"""
;
String response = client.post()
.uri("/api/generate")
.bodyValue(body)
.retrieve()
.bodyToMono(String.class)
.block();

For production Java apps connecting to a remote Ollama instance, set the base URL in application.properties as ollama.base-url=http://YOUR_SERVER_IP:11434 — and remember that if Ollama is on a different machine, you need OLLAMA_HOST=0.0.0.0 on that machine for it to accept external connections.

Running Ollama with Docker

Docker is a good option if you want a clean, reproducible setup or if you want to run Ollama on a server.

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For NVIDIA GPU support inside Docker, you need the nvidia-container-toolkit installed, then:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Inside the container, verify GPU access by running nvidia-smi. If it fails, update your NVIDIA drivers — Ollama v0.6 and above requires driver version 535 or newer inside Docker.

After the container is running, pull models by running commands inside it:

docker exec -it ollama ollama pull llama3.2:3b

AMD GPU configuration in Docker is more complex; running Ollama directly on the host is the easier path for AMD cards.

Open WebUI: The ChatGPT Interface for Ollama

The command line is fine for testing, but most people want a proper chat interface with history, model switching, and file uploads. That is where Open WebUI comes in.

Open WebUI in 2026 supports over nine vector databases for RAG, fifteen or more web search providers, voice input and output, MCP server integration, multi-user RBAC, LDAP/SSO, and native Python function tools — all running locally. It crossed 90,000 GitHub stars in 2025.

The fastest setup is the bundled Docker image that includes both Ollama and Open WebUI in one container:

# With GPU
docker run -d -p 3000:8080 --gpus=all \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:ollama
# CPU only
docker run -d -p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:ollama

Then open http://localhost:3000 in your browser. The first account you create becomes the admin.

If you already have Ollama running separately on the host, run Open WebUI as its own container and connect it to Ollama:

docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main

The most common issue is Open WebUI showing Ollama as disconnected. Run curl http://localhost:11434/api/tags from your terminal — if that returns a list of models, Ollama is running and the issue is Docker networking. Try setting OLLAMA_HOST=0.0.0.0 as an environment variable before starting Ollama, which makes it listen on all interfaces rather than just localhost.

After connecting, you can pull new models directly from the Open WebUI settings panel without touching the command line. Go to Settings → Models, type the model name in the pull field — for example qwen2.5-coder:7b — and click download. Progress appears in the UI, same as running ollama pull but without switching to the terminal.

A Quick Note on the Modelfile

Ollama supports custom model configurations through a Modelfile — basically a text file where you set a system prompt, temperature, context size, and other parameters. If you always want a model to behave a certain way, this saves you from passing the same system prompt every single time.

A simple example:

FROM llama3.2
PARAMETER temperature 0.7
SYSTEM "You are a senior Python developer. Reply concisely and with working code only."

Save that as Modelfile, then build it:

ollama create my-python-dev -f ./Modelfile
ollama run my-python-dev

Now my-python-dev shows up in your model list, in Open WebUI too.

Things That Will Catch You Off Guard

A few gotchas based on experience that most guides skip.

Model size vs available VRAM. Ollama will try to load a model even if it does not fully fit in VRAM — it offloads some layers to CPU automatically. This lets you run larger models than your GPU technically fits, but the speed penalty for offloaded layers is bad. Run ollama run <model> and watch the logs — it shows how many layers went to GPU vs. CPU. If fewer than half the layers went to GPU, the model is too big and it will run slowly.

Context window size. Ollama uses a default context window of 4096 tokens. For longer conversations or documents, you might need to set num_ctx higher — but higher context costs more memory, so watch your VRAM.

Model names are case-sensitive and must match exactly. This one trips up Python developers coming from the OpenAI SDK where "gpt-4" just works. With Ollama, "Llama3" is not the same as "llama3.2:3b". Always run ollama list to see the exact name.

And the 0.30 RC line — as of late May 2026, the 0.30.0 release is in RC21 and has open issues with specific models including laguna-xs.2 and llama3.2-vision. If you are hitting GPU detection oddities, falling back to 0.22.x is the safer move while the rewrite stabilizes.

Putting It Together

So basically, the typical setup most developers land on is: Ollama on the host for inference, Open WebUI running in Docker for the chat interface, Python or REST API calls for any automation or app development. That combination covers almost every use case.

If you want to go further, RAG pipelines are the natural next step — Ollama’s /api/embed endpoint with nomic-embed-text as the embedding model, paired with any vector database like ChromaDB. Open WebUI actually has this built in, so you can upload a PDF and chat with it without writing any code.

Start with the smallest model that does what you need. Pull gemma4:4b or llama3.2:3b, get your workflow running end to end, and then decide if you actually need something bigger. Nine times out of ten, the small model handles it fine — and upgrading later is just one ollama pull away.

Post a Comment

Previous Post Next Post