Skip to main content

llama.cpp

Run GGUF models entirely on-device with llama-cpp-python. No API key required. Works on CPU or GPU.

Install

pip install synapsekit[llamacpp]

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

For Apple Silicon (Metal):

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Download a model

Download a GGUF file from Hugging Face:

# Example: Llama 3.1 8B quantized
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models

Usage

from synapsekit.llm.llamacpp import LlamaCppLLM
from synapsekit import LLMConfig

config = LLMConfig(
model="llama-3.1-8b",
api_key="", # no key needed
provider="llamacpp",
)

llm = LlamaCppLLM(
config,
model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
)

# Streaming
async for token in llm.stream("Explain gradient descent"):
print(token, end="")

# Generate
response = await llm.generate("What is the capital of France?")

GPU offloading

Use n_gpu_layers to offload layers to the GPU. Set to -1 to offload all layers:

llm = LlamaCppLLM(
config,
model_path="./models/llama-3.1-8b.gguf",
n_gpu_layers=35, # offload 35 layers to GPU
n_ctx=4096, # context window size
)

Constructor parameters

ParameterDefaultDescription
model_pathPath to the GGUF model file (required)
n_ctx2048Context window size
n_gpu_layers0Number of layers to offload to GPU (0 = CPU only, -1 = all)
top_p0.95Top-p sampling parameter

Any extra kwargs are forwarded directly to llama_cpp.Llama().

RAG with local models

from synapsekit import RAG

rag = RAG(
model="./models/llama-3.1-8b.gguf",
api_key="",
provider="llamacpp",
)
rag.add("Your document text here")
answer = rag.ask_sync("Summarize the document.")
ModelSizeQ4_K_M sizeNotes
Llama 3.1 8B Instruct8B~4.7 GBBest balance
Llama 3.2 3B Instruct3B~2.0 GBFastest on CPU
Mistral 7B Instruct v0.37B~4.1 GBGood instruction following
Phi-3.5 Mini Instruct3.8B~2.2 GBStrong for small size
Gemma 2 9B Instruct9B~5.4 GBGoogle's open model
tip

Start with a Q4_K_M quantization — it strikes the best balance between quality and file size. Use Q8_0 for maximum quality if you have the VRAM.

note

LlamaCppLLM does not support native function calling. Use ReActAgent (which works via prompting) instead of FunctionCallingAgent.