GPT4All
Run GGUF models entirely on-device using GPT4All Python bindings. No API key, no internet connection required after model download.
Install
pip install synapsekit[gpt4all]
Download a model
Download any GPT4All-compatible GGUF model from GPT4All's model explorer or Hugging Face:
# Using the GPT4All Python API
from gpt4all import GPT4All
model = GPT4All("Phi-3-mini-4k-instruct.Q4_0.gguf")
# Downloads automatically on first use
Usage
from synapsekit.llm.gpt4all import GPT4AllLLM
from synapsekit import LLMConfig
config = LLMConfig(
model="Phi-3-mini-4k-instruct.Q4_0.gguf",
api_key="", # no key needed
)
llm = GPT4AllLLM(config)
# Streaming
async for token in llm.stream("Explain neural networks briefly"):
print(token, end="", flush=True)
# Generate (awaitable)
response = await llm.generate("What is RAG?")
print(response)
With RAG
from synapsekit import RAG
rag = RAG(
model="Phi-3-mini-4k-instruct.Q4_0.gguf",
api_key="",
provider="gpt4all",
)
rag.add("SynapseKit is an async-native Python framework for LLM applications.")
answer = rag.ask_sync("What is SynapseKit?")
print(answer)
Parameters
| Parameter | Default | Description |
|---|---|---|
model | required | GGUF model filename or path |
n_threads | auto | Number of CPU threads |
device | "cpu" | "cpu" or "gpu" |
n_ctx | 2048 | Context window length |
Notes
- Streaming is implemented via a callback shim: GPT4All's blocking
generate()is called inrun_in_executorto avoid blocking the event loop. - Model files are cached in
~/.cache/gpt4all/by default. - GPU acceleration requires a compatible GPU and the CUDA build of GPT4All.