Skip to main content

Ollama (Local)

Run open-source LLMs locally via Ollama. No API key required. Full privacy -- nothing leaves your machine.

Install Ollama

macOS

brew install ollama
ollama serve

Linux

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

Windows

Download the installer from ollama.com/download and run it.

Then install the SynapseKit package:

pip install synapsekit[ollama]

Pull a model

ollama pull llama3.2
ollama pull mistral
ollama pull gemma2
ollama pull phi3
ollama pull codellama
ollama pull deepseek-r1

Via the RAG facade

from synapsekit import RAG

rag = RAG(model="llama3.2", api_key="", provider="ollama")
rag.add("Your document text here")

answer = rag.ask_sync("Summarize the document.")
print(answer)

Direct usage

from synapsekit.llm.ollama import OllamaLLM
from synapsekit.llm.base import LLMConfig

llm = OllamaLLM(LLMConfig(
model="llama3.2",
api_key="",
provider="ollama",
temperature=0.7,
max_tokens=512,
))

async for token in llm.stream("Explain async Python in one paragraph."):
print(token, end="", flush=True)

Custom base URL

If Ollama is running on a different host (e.g. a GPU server on your LAN):

llm = OllamaLLM(
LLMConfig(model="llama3.2", api_key="", provider="ollama"),
base_url="http://192.168.1.50:11434",
)

Supported models

Any model available from ollama pull:

ModelSizeRAM RequiredNotes
llama3.23B~4 GBFast, great for most tasks
llama3.18B~8 GBGood quality
llama3.1:70b70B (Q4)~40 GBHigh quality, needs GPU
mistral7B~8 GBStrong reasoning
gemma29B~10 GBGoogle's open model
phi33.8B~4 GBMicrosoft, fast + efficient
codellama7B~8 GBCode generation
deepseek-r17B~8 GBReasoning with chain of thought
nomic-embed-text~1 GBEmbeddings only

GPU memory guide

Model sizeMinimum VRAMRecommended
1-3B4 GBGTX 1650, M1
7-8B8 GBRTX 3070, M2
13B12 GBRTX 3080, M2 Pro
70B (Q4)40 GBA100, M2 Ultra

Models that don't fit in VRAM run on CPU -- much slower.

Ollama-specific options

llm = OllamaLLM(
LLMConfig(model="llama3.2", api_key="", provider="ollama"),
keep_alive="10m", # keep model loaded in VRAM after request
num_ctx=8192, # context window override (default: model default)
)
OptionDescription
keep_aliveTime to keep model in memory. "0" unloads immediately, "-1" keeps forever
num_ctxOverride context window size
num_gpuNumber of GPU layers to offload
num_threadCPU threads to use

Function calling

Some Ollama models support function calling (e.g. llama3.1, mistral-nemo):

from synapsekit import FunctionCallingAgent, tool
from synapsekit.llm.ollama import OllamaLLM
from synapsekit.llm.base import LLMConfig

@tool
def get_weather(city: str) -> str:
"""Get the weather for a city."""
return f"It's sunny in {city}, 24 degrees C"

llm = OllamaLLM(LLMConfig(
model="llama3.1",
api_key="",
provider="ollama",
))

agent = FunctionCallingAgent(llm=llm, tools=[get_weather])
answer = await agent.run("What's the weather in Tokyo?")
caution

Not all Ollama models support function calling. Use llama3.1 or later for reliable results. For other models, use ReActAgent instead.

Use in GitHub Actions (CI)

Run tests with a local Ollama model in CI:

# .github/workflows/test.yml
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start Ollama
run: |
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
sleep 5
ollama pull phi3
- name: Run tests
run: |
pip install synapsekit[ollama]
pytest tests/

Error handling

from synapsekit.exceptions import LLMError

try:
response = await llm.generate("Hello")
except LLMError as e:
if "connection refused" in str(e).lower():
print("Ollama is not running. Start it with: ollama serve")
elif "model not found" in str(e).lower():
print("Pull the model first: ollama pull llama3.2")
else:
raise
tip

To list all locally available models: ollama list