LLM Providers

SynapseKit supports 33 LLM providers. All extend BaseLLM and share the same async interface.

📄️ OpenAI

Use OpenAI's GPT models with streaming, function calling, vision, and structured output.

📄️ Anthropic

Use Anthropic's Claude models with streaming, tool use, vision, and large context windows.

📄️ Ollama (Local)

Run open-source LLMs locally via Ollama. No API key required. Full privacy -- nothing leaves your machine.

📄️ Cohere

Cohere provides Command R and Command R+ models optimized for retrieval-augmented generation, along with best-in-class embedding and reranking models.

Mistral AI provides high-quality European AI models via an OpenAI-compatible API. Mistral models are known for their strong reasoning, code generation, and function calling capabilities at competitive pricing.

📄️ Google Gemini

Use Google's Gemini models with up to 1M token context, multimodal inputs, and native function calling.

📄️ AWS Bedrock

Run Claude, Titan, Llama, Mistral, and other models via AWS Bedrock. Uses your AWS credentials -- no separate AI vendor account needed.

📄️ Azure OpenAI

Use OpenAI models (GPT-4o, GPT-4o-mini, o1, etc.) hosted on your own Azure resource. Azure OpenAI provides enterprise compliance features: data residency, private networking, Azure AD authentication, and SLA guarantees.

📄️ Groq

Ultra-fast inference with Groq's LPU (Language Processing Unit) hardware. Supports Llama, Mixtral, Gemma, and other open models.

📄️ DeepSeek

DeepSeek models via their OpenAI-compatible API. Excellent cost-to-performance ratio with strong reasoning capabilities.

📄️ OpenRouter

OpenRouter is a unified API that provides access to 200+ models from OpenAI, Anthropic, Meta, Mistral, Google, and more -- with automatic fallback and load balancing.

📄️ Together AI

Together AI provides fast, scalable inference for open-source models including Llama, Mistral, Qwen, and more -- with competitive pricing.

📄️ Fireworks AI

Fireworks AI provides optimized inference for open-source models with an OpenAI-compatible API. It offers some of the lowest latency for popular models like Llama and Mixtral, with their FireFunction models purpose-built for reliable tool use.

📄️ Perplexity AI

Perplexity AI provides search-augmented LLMs with real-time web access. Unlike standard LLMs, Perplexity's Sonar models automatically search the web and include citations in their responses — making them ideal for research, news monitoring, and fact-checking tasks.

📄️ Cerebras

Cerebras provides ultra-fast inference on their custom Wafer-Scale Engine (WSE) hardware. With speeds exceeding 2,100 tokens/second, Cerebras is the fastest cloud inference option available for supported models.

📄️ Google Vertex AI

Install

📄️ Moonshot AI

Moonshot AI's Kimi models — long-context Chinese-English bilingual LLMs with up to 128K context.

📄️ Zhipu AI

Zhipu AI's GLM (General Language Model) series — powerful Chinese-English bilingual models with function calling support.

📄️ Cloudflare AI

Cloudflare Workers AI — run inference on Cloudflare's global GPU network. Supports models via @cf/ and @hf/ model identifiers. No SDK required — uses Cloudflare's native REST API.

📄️ AI21 Labs

AI21 Labs' Jamba models — a hybrid SSM-Transformer architecture offering long context windows and low inference cost.

📄️ Databricks

Databricks Foundation Model APIs — access models like DBRX, Llama, Mixtral, and others hosted on your Databricks workspace via an OpenAI-compatible endpoint.

📄️ Baidu ERNIE

Baidu's ERNIE Bot (文心一言) — a family of Chinese-English bilingual LLMs with strong performance on Chinese language tasks.

📄️ llama.cpp

Run GGUF models entirely on-device with llama-cpp-python. No API key required. Works on CPU or GPU.

📄️ Minimax AI

Minimax's language models with SSE streaming support. Requires a group_id in addition to an API key.

📄️ Aleph Alpha

Aleph Alpha's Luminous and Pharia language models — European-built LLMs with strong German and multilingual capabilities.

📄️ Hugging Face

Access thousands of open-source models via the Hugging Face Inference API. Supports both the free Serverless API and Dedicated Inference Endpoints.

📄️ SambaNova

SambaNova Cloud provides fast inference on open-source models including Meta Llama, Qwen, and others, using the OpenAI-compatible API.

📄️ xAI (Grok)

xAI's Grok models via the OpenAI-compatible API.

📄️ NovitaAI

NovitaAI hosts popular open models (Llama, Mistral, Qwen, etc.) via an OpenAI-compatible API.

📄️ Writer (Palmyra)

Writer's Palmyra models via the OpenAI-compatible API. Includes domain-specific models for medicine and finance.

📄️ LM Studio (Local)

Run local LLMs via LM Studio's OpenAI-compatible server. No API key required. Everything runs on your machine.

📄️ GPT4All

Run GGUF models entirely on-device using GPT4All Python bindings. No API key, no internet connection required after model download.

📄️ vLLM

High-throughput LLM inference via vLLM's OpenAI-compatible API. Run self-hosted models with PagedAttention for maximum GPU utilisation.

📄️ Replicate

Run thousands of open-source models via Replicate's cloud hosting platform — Llama, Mistral, SDXL, Whisper, and more — with a single API key and no GPU management.

📄️ Caching & Retries

SynapseKit provides opt-in response caching and exponential backoff retries for all LLM providers. Both are configured through LLMConfig and are disabled by default — zero behavior change for existing code.

📄️ CostRouter & FallbackChain

SynapseKit provides two drop-in BaseLLM subclasses for intelligent model routing: CostRouter (cheapest model meeting quality constraints) and FallbackChain (ordered priority with cascading fallback).