LLM Providers
SynapseKit supports 33 LLM providers. All extend BaseLLM and share the same async interface.
📄️ LLM Overview
PyPI
📄️ OpenAI
Use OpenAI's GPT models with streaming, function calling, vision, and structured output.
📄️ Anthropic
Use Anthropic's Claude models with streaming, tool use, vision, and large context windows.
📄️ Ollama (Local)
Run open-source LLMs locally via Ollama. No API key required. Full privacy -- nothing leaves your machine.
📄️ Cohere
Cohere provides Command R and Command R+ models optimized for retrieval-augmented generation, along with best-in-class embedding and reranking models.
📄️ Mistral AI
Mistral AI provides high-quality European AI models via an OpenAI-compatible API. Mistral models are known for their strong reasoning, code generation, and function calling capabilities at competitive pricing.
📄️ Google Gemini
Use Google's Gemini models with up to 1M token context, multimodal inputs, and native function calling.
📄️ AWS Bedrock
Run Claude, Titan, Llama, Mistral, and other models via AWS Bedrock. Uses your AWS credentials -- no separate AI vendor account needed.
📄️ Azure OpenAI
Use OpenAI models (GPT-4o, GPT-4o-mini, o1, etc.) hosted on your own Azure resource. Azure OpenAI provides enterprise compliance features: data residency, private networking, Azure AD authentication, and SLA guarantees.
📄️ Groq
Ultra-fast inference with Groq's LPU (Language Processing Unit) hardware. Supports Llama, Mixtral, Gemma, and other open models.
📄️ DeepSeek
DeepSeek models via their OpenAI-compatible API. Excellent cost-to-performance ratio with strong reasoning capabilities.
📄️ OpenRouter
OpenRouter is a unified API that provides access to 200+ models from OpenAI, Anthropic, Meta, Mistral, Google, and more -- with automatic fallback and load balancing.
📄️ Together AI
Together AI provides fast, scalable inference for open-source models including Llama, Mistral, Qwen, and more -- with competitive pricing.
📄️ Fireworks AI
Fireworks AI provides optimized inference for open-source models with an OpenAI-compatible API. It offers some of the lowest latency for popular models like Llama and Mixtral, with their FireFunction models purpose-built for reliable tool use.
📄️ Perplexity AI
Perplexity AI provides search-augmented LLMs with real-time web access. Unlike standard LLMs, Perplexity's Sonar models automatically search the web and include citations in their responses — making them ideal for research, news monitoring, and fact-checking tasks.
📄️ Cerebras
Cerebras provides ultra-fast inference on their custom Wafer-Scale Engine (WSE) hardware. With speeds exceeding 2,100 tokens/second, Cerebras is the fastest cloud inference option available for supported models.
📄️ Google Vertex AI
Install
📄️ Moonshot AI
Moonshot AI's Kimi models — long-context Chinese-English bilingual LLMs with up to 128K context.
📄️ Zhipu AI
Zhipu AI's GLM (General Language Model) series — powerful Chinese-English bilingual models with function calling support.
📄️ Cloudflare AI
Cloudflare Workers AI — run inference on Cloudflare's global GPU network. Supports models via @cf/ and @hf/ model identifiers. No SDK required — uses Cloudflare's native REST API.
📄️ AI21 Labs
AI21 Labs' Jamba models — a hybrid SSM-Transformer architecture offering long context windows and low inference cost.
📄️ Databricks
Databricks Foundation Model APIs — access models like DBRX, Llama, Mixtral, and others hosted on your Databricks workspace via an OpenAI-compatible endpoint.
📄️ Baidu ERNIE
Baidu's ERNIE Bot (文心一言) — a family of Chinese-English bilingual LLMs with strong performance on Chinese language tasks.
📄️ llama.cpp
Run GGUF models entirely on-device with llama-cpp-python. No API key required. Works on CPU or GPU.
📄️ Minimax AI
Minimax's language models with SSE streaming support. Requires a group_id in addition to an API key.
📄️ Aleph Alpha
Aleph Alpha's Luminous and Pharia language models — European-built LLMs with strong German and multilingual capabilities.
📄️ Hugging Face
Access thousands of open-source models via the Hugging Face Inference API. Supports both the free Serverless API and Dedicated Inference Endpoints.
📄️ SambaNova
SambaNova Cloud provides fast inference on open-source models including Meta Llama, Qwen, and others, using the OpenAI-compatible API.
📄️ xAI (Grok)
xAI's Grok models via the OpenAI-compatible API.
📄️ NovitaAI
NovitaAI hosts popular open models (Llama, Mistral, Qwen, etc.) via an OpenAI-compatible API.
📄️ Writer (Palmyra)
Writer's Palmyra models via the OpenAI-compatible API. Includes domain-specific models for medicine and finance.
📄️ LM Studio (Local)
Run local LLMs via LM Studio's OpenAI-compatible server. No API key required. Everything runs on your machine.
📄️ GPT4All
Run GGUF models entirely on-device using GPT4All Python bindings. No API key, no internet connection required after model download.
📄️ vLLM
High-throughput LLM inference via vLLM's OpenAI-compatible API. Run self-hosted models with PagedAttention for maximum GPU utilisation.
📄️ Replicate
Run thousands of open-source models via Replicate's cloud hosting platform — Llama, Mistral, SDXL, Whisper, and more — with a single API key and no GPU management.
📄️ Caching & Retries
SynapseKit provides opt-in response caching and exponential backoff retries for all LLM providers. Both are configured through LLMConfig and are disabled by default — zero behavior change for existing code.
📄️ CostRouter & FallbackChain
SynapseKit provides two drop-in BaseLLM subclasses for intelligent model routing: CostRouter (cheapest model meeting quality constraints) and FallbackChain (ordered priority with cascading fallback).