llama.cpp

Run GGUF models entirely on-device with llama-cpp-python. No API key required. Works on CPU or GPU.

Install

pip install synapsekit[llamacpp]

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

For Apple Silicon (Metal):

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Download a model

Download a GGUF file from Hugging Face:

# Example: Llama 3.1 8B quantized
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

Usage

from synapsekit.llm.llamacpp import LlamaCppLLM
from synapsekit import LLMConfig

config = LLMConfig(
    model="llama-3.1-8b",
    api_key="",  # no key needed
    provider="llamacpp",
)

llm = LlamaCppLLM(
    config,
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
)

# Streaming
async for token in llm.stream("Explain gradient descent"):
    print(token, end="")

# Generate
response = await llm.generate("What is the capital of France?")

GPU offloading

Use n_gpu_layers to offload layers to the GPU. Set to -1 to offload all layers:

llm = LlamaCppLLM(
    config,
    model_path="./models/llama-3.1-8b.gguf",
    n_gpu_layers=35,   # offload 35 layers to GPU
    n_ctx=4096,        # context window size
)

Constructor parameters

Parameter	Default	Description
`model_path`	—	Path to the GGUF model file (required)
`n_ctx`	`2048`	Context window size
`n_gpu_layers`	`0`	Number of layers to offload to GPU (0 = CPU only, -1 = all)
`top_p`	`0.95`	Top-p sampling parameter

Any extra kwargs are forwarded directly to llama_cpp.Llama().

RAG with local models

from synapsekit import RAG

rag = RAG(
    model="./models/llama-3.1-8b.gguf",
    api_key="",
    provider="llamacpp",
)
rag.add("Your document text here")
answer = rag.ask_sync("Summarize the document.")

Recommended models

Model	Size	Q4_K_M size	Notes
Llama 3.1 8B Instruct	8B	~4.7 GB	Best balance
Llama 3.2 3B Instruct	3B	~2.0 GB	Fastest on CPU
Mistral 7B Instruct v0.3	7B	~4.1 GB	Good instruction following
Phi-3.5 Mini Instruct	3.8B	~2.2 GB	Strong for small size
Gemma 2 9B Instruct	9B	~5.4 GB	Google's open model

tip

Start with a Q4_K_M quantization — it strikes the best balance between quality and file size. Use Q8_0 for maximum quality if you have the VRAM.

note

LlamaCppLLM does not support native function calling. Use ReActAgent (which works via prompting) instead of FunctionCallingAgent.

Install​

Download a model​

Usage​

GPU offloading​

Constructor parameters​

RAG with local models​

Recommended models​