Skip to content

How To Set Up Ollama

Open in ChatGPT ↗
Ask ChatGPT about this page
Open in Claude ↗
Ask Claude about this page
Copied!

Install Ollama, run LLMs locally, compare model performance, and integrate local models into your apps using Warp.

Running AI models locally just got easier — and faster — with Ollama.

In this guide, we’ll walk through how to use Warp to install, profile, and integrate Ollama into your local setup.


Before running large language models (LLMs) locally, confirm your hardware can handle them.

Example setups:

  • Mac: 64GB unified memory — good for larger models but with lower throughput.
  • Windows (NVIDIA RTX 5090): 32GB VRAM — excellent performance, but limited by VRAM capacity.

🧠 Rule of thumb: You’ll need roughly 1GB of VRAM per billion parameters.


Run a model locally:

ollama run gpt-oss

For example:

  • Try GPT-OSS 20B (requires ≥16GB VRAM, supports tool calling).
  • Then try Mistral 8B for a faster, smaller alternative.

Compare their performance and quality side-by-side.
Use Warp to easily monitor GPU usage and model response time.


Here’s a quick glossary for choosing the right local model:

TermMeaning
ThinkingThe model “thinks” before answering; better for complex reasoning.
ToolsModels can use external utilities (e.g., web search).
VisionCan process and respond to images.
EmbeddingConverts text to numeric form for search or RAG pipelines.
QuantizationReduces memory use by lowering precision (e.g., 4-bit).

Most apps use OpenAI-compatible APIs, so integration is simple.

  1. Open your app’s code in Warp.
  2. Locate the OpenAI client initialization.
  3. Replace the base URL with Ollama’s
  4. Update your API key and model name.

Warp helps you quickly locate, edit, and test the integration directly from the terminal.


Pull and modify a model.

Then save it as a custom model with new settings like temperature or system prompt.

Use Warp to generate a model file automatically.

This adds a structured system prompt for that task — ready to use instantly.