Cerebras

Cerebras provides ultra-fast inference on their custom Wafer-Scale Engine (WSE) hardware. With speeds exceeding 2,100 tokens/second, Cerebras is the fastest cloud inference option available for supported models.

Install

pip install synapsekit[openai]

Cerebras uses the OpenAI-compatible API, so it requires the openai package.

Basic usage

from synapsekit import LLMConfig
from synapsekit.llm.cerebras import CerebrasLLM

llm = CerebrasLLM(LLMConfig(
    model="llama3.1-70b",
    api_key="csk-...",
))

response = await llm.generate("Explain large language models in three sentences.")
print(response)
# Large language models are trained on vast text datasets...

Streaming

from synapsekit import LLMConfig
from synapsekit.llm.cerebras import CerebrasLLM

llm = CerebrasLLM(LLMConfig(
    model="llama3.1-8b",
    api_key="csk-...",
))

async for token in llm.stream("Write a quicksort implementation in Python."):
    print(token, end="", flush=True)
# def quicksort(arr):
#     if len(arr) <= 1:
#         return arr
#     ...

Available models

Model	Context	Speed (tok/s)	Best for
`llama3.1-8b`	128K	~2,100	Ultra-fast, interactive tasks
`llama3.1-70b`	128K	~450	High quality, still very fast
`llama-3.3-70b`	128K	~450	Latest Llama 3.3 weights

tip

llama3.1-8b on Cerebras is typically 10x faster than the same model on GPU-based providers, making it ideal for chatbots and real-time applications.

Speed comparison

Provider	Model	Median speed (tok/s)
Cerebras	Llama 3.1 8B	~2,100
Cerebras	Llama 3.1 70B	~450
Groq	Llama 3.1 8B	~800
Together AI	Llama 3.1 8B	~200
OpenAI	gpt-4o-mini	~120

Function calling

Cerebras supports OpenAI-compatible function calling on Llama models:

from synapsekit import FunctionCallingAgent, tool
from synapsekit import LLMConfig
from synapsekit.llm.cerebras import CerebrasLLM

@tool
def get_stock_price(ticker: str) -> dict:
    """Get the current stock price for a ticker symbol."""
    # In practice, call a real market data API
    prices = {"AAPL": 189.30, "GOOG": 175.20, "MSFT": 415.50}
    return {"ticker": ticker, "price": prices.get(ticker, 0.0), "currency": "USD"}

@tool
def calculate_portfolio_value(holdings: dict) -> float:
    """Calculate total portfolio value given a dict of {ticker: shares}."""
    # Simplified calculation
    prices = {"AAPL": 189.30, "GOOG": 175.20, "MSFT": 415.50}
    total = sum(shares * prices.get(ticker, 0) for ticker, shares in holdings.items())
    return round(total, 2)

llm = CerebrasLLM(LLMConfig(model="llama3.1-70b", api_key="csk-..."))
agent = FunctionCallingAgent(llm=llm, tools=[get_stock_price, calculate_portfolio_value])

answer = await agent.run("What is AAPL's price and what is my portfolio worth if I have 10 AAPL and 5 MSFT?")
print(answer)
# AAPL is trading at $189.30. Your portfolio (10 AAPL + 5 MSFT) is worth $3,970.50.

Raw `call_with_tools`

tools = [
    {
        "type": "function",
        "function": {
            "name": "run_sql_query",
            "description": "Execute a SQL query and return results",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "SQL SELECT query"},
                    "limit": {"type": "integer", "default": 10},
                },
                "required": ["query"],
            },
        },
    }
]

result = await llm.call_with_tools(
    messages=[{"role": "user", "content": "Show me the top 5 users by order count"}],
    tools=tools,
)
# result["tool_calls"] → [{"name": "run_sql_query", "arguments": {"query": "SELECT user_id, COUNT(*) FROM orders GROUP BY user_id ORDER BY COUNT(*) DESC LIMIT 5"}}]

Batch processing

For high-throughput workloads, run multiple concurrent requests:

import asyncio
from synapsekit import LLMConfig
from synapsekit.llm.cerebras import CerebrasLLM

llm = CerebrasLLM(LLMConfig(model="llama3.1-8b", api_key="csk-..."))

prompts = [
    "Translate to Spanish: Hello world",
    "Translate to French: Hello world",
    "Translate to German: Hello world",
    "Translate to Japanese: Hello world",
    "Translate to Arabic: Hello world",
]

# Fire all requests concurrently
results = await asyncio.gather(*[llm.generate(p) for p in prompts])
for prompt, result in zip(prompts, results):
    print(f"{prompt[:30]} → {result}")
# Translate to Spanish: Hello w → Hola mundo
# Translate to French: Hello wo → Bonjour le monde
# ...

Cost tracking

from synapsekit.observability import CostTracker

tracker = CostTracker()
llm = CerebrasLLM(LLMConfig(model="llama3.1-70b", api_key="csk-..."))
llm.attach_tracker(tracker)

await llm.generate("Summarize the history of computing in 200 words.")
print(f"Cost: ${tracker.total_cost_usd:.6f}")

Custom base URL

llm = CerebrasLLM(
    LLMConfig(model="llama3.1-70b", api_key="csk-..."),
    base_url="http://localhost:8000/v1",
)

Parameters reference

Parameter	Description
`model`	Cerebras model ID (e.g. `llama3.1-70b`)
`api_key`	Your Cerebras API key (starts with `csk-`)
`temperature`	Sampling temperature (0.0–1.0)
`max_tokens`	Maximum output tokens
`base_url`	Custom API base URL (default: `https://api.cerebras.ai/v1`)

Error handling

from synapsekit.exceptions import LLMError, RateLimitError, AuthenticationError

try:
    response = await llm.generate("Hello")
except AuthenticationError:
    print("Invalid API key — get one at cloud.cerebras.ai")
except RateLimitError as e:
    print(f"Rate limited. Retry after {e.retry_after}s")
except LLMError as e:
    print(f"Cerebras error: {e}")

tip

Cerebras is ideal for latency-sensitive use cases like streaming chatbots, real-time code completion, and interactive agents. The extreme token throughput means users see meaningful output almost instantly.

Install​

Basic usage​

Streaming​

Available models​

Speed comparison​

Function calling​

Raw call_with_tools​

Batch processing​

Cost tracking​

Custom base URL​

Parameters reference​

Error handling​