Retriever
The Retriever finds the most relevant chunks for a query using vector similarity and optional BM25 reranking.
Basic usage
from synapsekit.retriever import Retriever
from synapsekit.vectorstore import InMemoryVectorStore
from synapsekit.embeddings import SynapsekitEmbeddings
embeddings = SynapsekitEmbeddings()
store = InMemoryVectorStore(embeddings)
store.add(["Chunk one...", "Chunk two...", "Chunk three..."])
retriever = Retriever(store)
results = await retriever.retrieve("Your query here", top_k=3)
for doc in results:
print(doc.text, doc.score)
BM25 reranking
Enable hybrid retrieval (vector + BM25) for better precision:
retriever = Retriever(store, use_bm25=True, bm25_weight=0.3)
results = await retriever.retrieve("Your query", top_k=5)
Requires rank-bm25 (included as a hard dependency).
Metadata filtering
Filter results by metadata before ranking:
results = await retriever.retrieve(
"Your query",
top_k=5,
metadata_filter={"source": "report.pdf"},
)
Only documents whose metadata contains all specified key-value pairs are considered.
MMR retrieval (diversity)
Maximal Marginal Relevance balances relevance with diversity to reduce redundant results:
results = await retriever.retrieve_mmr(
"Your query",
top_k=5,
lambda_mult=0.5, # 0 = max diversity, 1 = max relevance
fetch_k=20, # Initial candidate pool size
)
MMR greedily selects documents that maximize:
lambda * relevance(query, doc) - (1-lambda) * max_similarity(doc, selected_docs)
RAG Fusion
Generate multiple query variations with an LLM and fuse results using Reciprocal Rank Fusion for better recall:
from synapsekit import RAGFusionRetriever
fusion = RAGFusionRetriever(
retriever=retriever,
llm=llm,
num_queries=3, # Number of query variations to generate
rrf_k=60, # RRF constant (higher = less aggressive reranking)
)
results = await fusion.retrieve("What is quantum computing?", top_k=5)
The process:
- LLM generates
num_queriesvariations of your query - Each variation (plus the original) is used to retrieve results
- Results are fused using Reciprocal Rank Fusion scoring
- Documents appearing in multiple result sets rank higher
Contextual Retrieval
Inspired by Anthropic's Contextual Retrieval approach. Before embedding, each chunk is enriched with a short LLM-generated context sentence, improving accuracy for ambiguous chunks:
from synapsekit import ContextualRetriever
cr = ContextualRetriever(
retriever=retriever,
llm=llm,
)
# Add chunks — each gets a context sentence prepended before embedding
await cr.add_with_context(["chunk one...", "chunk two..."])
# Retrieve as normal
results = await cr.retrieve("What is quantum computing?", top_k=5)
The process:
- For each chunk, the LLM generates a 1-2 sentence context
- The context is prepended to the chunk before embedding
- At retrieval time, the enriched embeddings improve search accuracy
You can customize the context generation prompt:
cr = ContextualRetriever(
retriever=retriever,
llm=llm,
context_prompt="Summarize this chunk in one sentence:\n{chunk}",
)
Sentence Window Retrieval
Embeds individual sentences for fine-grained search, but returns a window of surrounding sentences for richer context:
from synapsekit import SentenceWindowRetriever
swr = SentenceWindowRetriever(
retriever=retriever,
window_size=2, # Include 2 sentences before and after the match
)
# Add full documents — they're split into sentences automatically
await swr.add_documents(["Full document text here. With multiple sentences. And more."])
# Retrieve — matched sentences are expanded with surrounding context
results = await swr.retrieve("query", top_k=3)
The process:
- Documents are split into individual sentences
- Each sentence is embedded independently for fine-grained matching
- At retrieval time, matched sentences are expanded with
window_sizesurrounding sentences
Self-Query Retrieval
The SelfQueryRetriever uses an LLM to decompose a natural-language question into a semantic search query and structured metadata filters. This automates the process of extracting filters from user questions.
from synapsekit import SelfQueryRetriever
sqr = SelfQueryRetriever(
retriever=retriever,
llm=llm,
metadata_fields=["source", "author", "year", "category"],
)
# The LLM extracts filters automatically
results = await sqr.retrieve("Papers by John about ML from 2024", top_k=5)
The process:
- The LLM analyzes the question and extracts a semantic query (
"ML papers") and metadata filters ({"author": "John", "year": "2024"}) - The semantic query is used for vector search
- The metadata filters are applied to narrow results
Inspecting extracted filters
Use retrieve_with_filters() to see what the LLM extracted:
results, info = await sqr.retrieve_with_filters(
"Papers by John about ML from 2024", top_k=5
)
print(info["query"]) # "ML papers"
print(info["filters"]) # {"author": "John", "year": "2024"}
Custom prompt
Override the default decomposition prompt:
sqr = SelfQueryRetriever(
retriever=retriever,
llm=llm,
metadata_fields=["source", "year"],
prompt="Custom prompt with {fields} and {question} placeholders...",
)
Parent Document Retrieval
The ParentDocumentRetriever embeds small chunks for precise matching but returns full parent documents for richer context:
from synapsekit import ParentDocumentRetriever
pdr = ParentDocumentRetriever(
retriever=retriever,
chunk_size=200,
chunk_overlap=50,
)
# Add full documents — they're chunked internally
await pdr.add_documents(["Full document one...", "Full document two..."])
# Retrieve — returns full parent documents, not small chunks
results = await pdr.retrieve("query", top_k=3)
The process:
- Documents are split into small overlapping chunks (controlled by
chunk_sizeandchunk_overlap) - Each chunk is embedded and stored with a reference to its parent document
- At retrieval time, matched chunks are traced back to their parent documents
- Duplicate parents are deduplicated — each parent appears at most once
This is ideal when you need the precision of small-chunk search but want to feed the LLM the full document for context.
Adding documents with metadata
await pdr.add_documents(
["Document one...", "Document two..."],
metadata=[{"source": "report.pdf"}, {"source": "paper.pdf"}],
)
Metadata is propagated to all chunks of a document.
Cross-Encoder Reranking
The CrossEncoderReranker uses a cross-encoder model to rerank retrieval results for higher precision. Cross-encoders score query-document pairs jointly, giving much more accurate relevance scores than bi-encoder similarity alone.
from synapsekit import CrossEncoderReranker
reranker = CrossEncoderReranker(
retriever=retriever,
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
fetch_k=20, # Initial candidates to retrieve before reranking
)
results = await reranker.retrieve("What is RAG?", top_k=5)
The process:
fetch_kcandidates are retrieved using standard vector search- Each candidate is scored jointly with the query using the cross-encoder
- Results are reranked by cross-encoder score and the top
top_kare returned
Getting scores
Use retrieve_with_scores() to see the cross-encoder scores:
results = await reranker.retrieve_with_scores("What is RAG?", top_k=5)
for r in results:
print(r["text"], r["cross_encoder_score"])
Requires sentence-transformers: pip install synapsekit[semantic]
CRAG (Corrective RAG)
The CRAGRetriever implements self-correcting retrieval: it retrieves candidates, grades each for relevance using an LLM, and rewrites the query to retry if too few documents pass the relevance check.
from synapsekit import CRAGRetriever
crag = CRAGRetriever(
retriever=retriever,
llm=llm,
relevance_threshold=0.5, # Fraction of docs that must be relevant
max_retries=1, # Max query rewrites before giving up
)
results = await crag.retrieve("What is quantum computing?", top_k=5)
The process:
- Retrieve
top_kcandidates using the base retriever - LLM grades each document as "relevant" or "irrelevant" to the query
- If fewer than
relevance_thresholdfraction pass, the LLM rewrites the query - Retry retrieval with the rewritten query (up to
max_retriestimes) - Return only the documents that passed relevance grading
Inspecting grades
Use retrieve_with_grades() to see grading details:
results, info = await crag.retrieve_with_grades("query", top_k=5)
print(info["relevant_count"]) # Number of relevant docs
print(info["total_count"]) # Total docs retrieved
print(info["query_rewritten"]) # Whether the query was rewritten
print(info["final_query"]) # The (possibly rewritten) query used
Query Decomposition
The QueryDecompositionRetriever uses an LLM to break complex queries into simpler sub-queries, retrieves for each, and deduplicates results:
from synapsekit import QueryDecompositionRetriever
qdr = QueryDecompositionRetriever(
retriever=retriever,
llm=llm,
num_sub_queries=3, # Number of sub-queries to generate
)
results = await qdr.retrieve("Compare quantum and classical computing for ML", top_k=5)
The process:
- LLM decomposes the query into
num_sub_queriessimpler sub-queries - Each sub-query is used to retrieve results independently
- Results are deduplicated and returned
Inspecting sub-queries
results, sub_queries = await qdr.retrieve_with_sub_queries("query", top_k=5)
print(sub_queries) # ["What is quantum computing?", "What is classical computing?", ...]
Contextual Compression
The ContextualCompressionRetriever retrieves documents then uses an LLM to compress each to only the content relevant to the query:
from synapsekit import ContextualCompressionRetriever
ccr = ContextualCompressionRetriever(
retriever=retriever,
llm=llm,
fetch_k=10, # Retrieve this many, then compress
)
results = await ccr.retrieve("What is RAG?", top_k=5)
The process:
- Retrieve
fetch_kcandidates using the base retriever - LLM compresses each document, extracting only content relevant to the query
- Documents the LLM marks as "NOT_RELEVANT" are filtered out
- Top
top_kcompressed results are returned
Ensemble Retrieval
The EnsembleRetriever fuses results from multiple retrievers using weighted Reciprocal Rank Fusion (RRF):
from synapsekit import EnsembleRetriever
ensemble = EnsembleRetriever(
retrievers=[retriever_a, retriever_b],
weights=[0.7, 0.3], # Optional, defaults to equal weights
rrf_k=60, # RRF constant
)
results = await ensemble.retrieve("What is RAG?", top_k=5)
The process:
- Each retriever independently retrieves candidates
- Results are scored using weighted RRF:
score = weight / (rrf_k + rank + 1) - Scores are summed across retrievers for documents appearing in multiple result sets
- Final results are sorted by fused score
Cohere Reranking
The CohereReranker uses Cohere's rerank models to rerank retrieval results for higher precision. Unlike CrossEncoderReranker (local model), this uses the Cohere Rerank API.
from synapsekit import CohereReranker
reranker = CohereReranker(
retriever=retriever,
model="rerank-v3.5",
fetch_k=20, # Initial candidates to retrieve before reranking
)
results = await reranker.retrieve("What is RAG?", top_k=5)
The process:
fetch_kcandidates are retrieved using standard vector search- Candidates are sent to the Cohere Rerank API
- Results are reranked by relevance score and the top
top_kare returned
Getting scores
Use retrieve_with_scores() to see the Cohere relevance scores:
results = await reranker.retrieve_with_scores("What is RAG?", top_k=5)
for r in results:
print(r["text"], r["relevance_score"])
API key
The API key is resolved in order:
api_keyparameterCO_API_KEYenvironment variable
Requires cohere: pip install synapsekit[cohere]
Step-Back Retrieval
The StepBackRetriever generates a more abstract "step-back" question using an LLM, retrieves for both the original and step-back queries in parallel, and merges deduplicated results. This improves retrieval for specific or narrow questions by also searching with a broader perspective.
from synapsekit import StepBackRetriever
step_back = StepBackRetriever(
retriever=retriever,
llm=llm,
)
results = await step_back.retrieve("What is the melting point of gold?", top_k=5)
The process:
- The LLM generates a step-back (more abstract) question from the original query
- Both the original and step-back queries are used to retrieve results in parallel
- Results are merged and deduplicated, preserving order
Custom prompt template
Override the default prompt to control how step-back questions are generated:
step_back = StepBackRetriever(
retriever=retriever,
llm=llm,
prompt_template="Given this question, ask a more general version:\n{query}",
)
The template must include {query} as a placeholder for the user's question.
FLARE (Forward-Looking Active REtrieval)
The FLARERetriever implements an iterative retrieve-generate-retrieve loop. It generates an answer, identifies parts that need more information (marked with [SEARCH: ...]), retrieves for those sub-queries, and regenerates — repeating until no more search markers appear or max_iterations is reached.
from synapsekit import FLARERetriever
flare = FLARERetriever(
retriever=retriever,
llm=llm,
max_iterations=3,
)
results = await flare.retrieve("Explain the history of quantum computing", top_k=5)
The process:
- Initial retrieval for the original query
- LLM generates an answer, inserting
[SEARCH: sub-query]markers where it needs more information - Sub-queries are extracted from the markers
- If no markers are found, return current documents
- New retrieval is performed for each sub-query
- Results are merged, deduplicated, and the process repeats (up to
max_iterations)
Parameters
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for answer generation |
max_iterations | 3 | Maximum generate-retrieve cycles |
generate_prompt | built-in | Prompt for initial answer generation |
regenerate_prompt | built-in | Prompt for regeneration with new context |
HyDE (Hypothetical Document Embeddings)
The HyDERetriever generates a hypothetical answer to the query using an LLM, then uses that hypothetical answer as the search query. This often improves retrieval for complex or abstract questions because the hypothetical answer is closer in embedding space to relevant documents than the original question.
from synapsekit import HyDERetriever
hyde = HyDERetriever(
retriever=retriever,
llm=llm,
)
results = await hyde.retrieve("What is quantum entanglement?", top_k=5)
The process:
- The LLM generates a hypothetical passage that would answer the query
- The hypothetical passage is used as the search query (instead of the original question)
- Results are retrieved using the hypothetical passage, which is often closer to relevant documents in embedding space
Custom prompt template
Override the default prompt to control how hypothetical answers are generated:
hyde = HyDERetriever(
retriever=retriever,
llm=llm,
prompt_template="Write a short paragraph answering: {query}",
)
The template must include {query} as a placeholder for the user's question.
Hybrid Search Retrieval
The HybridSearchRetriever combines BM25 keyword matching with vector similarity using Reciprocal Rank Fusion (RRF). This gives you the best of both sparse (keyword) and dense (vector) retrieval.
from synapsekit import HybridSearchRetriever
hybrid = HybridSearchRetriever(
retriever=retriever,
bm25_weight=0.5,
vector_weight=0.5,
rrf_k=60,
)
# Build the BM25 index from your documents
hybrid.add_documents(["doc one text...", "doc two text...", "doc three text..."])
# Retrieve — fuses BM25 and vector results via RRF
results = await hybrid.retrieve("search query", top_k=5)
The process:
- Vector retrieval via the base retriever
- BM25 scoring on the indexed documents
- RRF fusion:
score = weight / (rrf_k + rank + 1)for both result sets - Results are sorted by fused score and deduplicated
Uses the existing rank-bm25 hard dependency — no extra install needed.
Self-RAG (Self-Reflective RAG)
The SelfRAGRetriever implements a self-reflective retrieval loop: retrieve candidates, grade each for relevance, generate an answer, check if the documents support the answer, and retry with a rewritten query if not.
from synapsekit import SelfRAGRetriever
self_rag = SelfRAGRetriever(
retriever=retriever,
llm=llm,
max_iterations=2,
relevance_threshold=0.5,
)
results = await self_rag.retrieve("What is quantum computing?", top_k=5)
The process:
- Retrieve candidates using the base retriever
- LLM grades each document as "relevant" or "irrelevant"
- LLM generates an answer from relevant documents
- LLM checks if the answer is "fully", "partially", or "not" supported
- If not fully supported, the query is rewritten and the process repeats
Inspecting reflection metadata
results, meta = await self_rag.retrieve_with_reflection("query", top_k=5)
print(meta["iterations"]) # Number of iterations performed
print(meta["support_level"]) # "fully", "partially", or "not"
Adaptive RAG
The AdaptiveRAGRetriever uses an LLM to classify query complexity (simple/moderate/complex) and routes to different retrieval strategies accordingly.
from synapsekit import AdaptiveRAGRetriever
adaptive = AdaptiveRAGRetriever(
llm=llm,
simple_retriever=basic_retriever,
moderate_retriever=fusion_retriever,
complex_retriever=multi_step_retriever,
)
results = await adaptive.retrieve("What is 2+2?") # → routed to simple
results = await adaptive.retrieve("Compare quantum and classical computing for ML") # → routed to complex
The process:
- LLM classifies the query as "simple", "moderate", or "complex"
- The query is routed to the corresponding retriever
- Fallback: if
moderate_retrieveris not provided, usessimple_retriever; ifcomplex_retrieveris not provided, usesmoderate_retriever
Inspecting classification
results, classification = await adaptive.retrieve_with_classification("query")
print(classification) # "simple", "moderate", or "complex"
GraphRAG (Knowledge Graph Retrieval)
The GraphRAGRetriever combines knowledge graph traversal with vector retrieval. It extracts entities from the query, traverses a knowledge graph to find related documents, and merges those with standard vector retrieval results.
from synapsekit import GraphRAGRetriever, KnowledgeGraph
# Build a knowledge graph
kg = KnowledgeGraph()
kg.add_triple("Python", "is_a", "programming language")
kg.add_triple("Python", "used_for", "machine learning")
kg.add_document_link("Python", "doc_1")
kg.add_document_link("machine learning", "doc_2")
# Or build from documents using an LLM
await kg.build_from_documents(["Python is a programming language used for ML..."], llm)
# Combine with vector retrieval
graphrag = GraphRAGRetriever(
retriever=retriever,
llm=llm,
knowledge_graph=kg,
max_hops=2,
)
results = await graphrag.retrieve("What is Python used for?", top_k=5)
The process:
- The LLM extracts entities from the query
- The knowledge graph is traversed up to
max_hopsfrom each entity - Related documents are gathered from the graph
- Standard vector retrieval runs in parallel
- Results are merged and deduplicated
Inspecting graph metadata
results, meta = await graphrag.retrieve_with_graph("query", top_k=5)
print(meta["entities_extracted"]) # Entities found in the query
print(meta["graph_docs"]) # Documents from graph traversal
print(meta["traversal_hops"]) # Max hops used
Multi-Step Retrieval
The MultiStepRetriever performs iterative retrieval-generation: retrieve documents, generate an answer, identify information gaps, retrieve for those gaps, and repeat until the answer is complete or max_steps is reached.
from synapsekit import MultiStepRetriever
ms = MultiStepRetriever(
retriever=retriever,
llm=llm,
max_steps=3,
)
results = await ms.retrieve("What is the history and future of quantum computing?", top_k=5)
The process:
- Initial retrieval for the original query
- LLM generates an answer from retrieved documents
- LLM identifies gaps — returns search queries for missing information, or "COMPLETE" if done
- Gap queries are used for additional retrieval
- New documents are added (deduplicated) and the process repeats
Inspecting the step trace
results, trace = await ms.retrieve_with_steps("query")
for step in trace:
print(step["step"], step["query"], step["new_docs"])
# step 0: initial query, N new docs
# step 1: ["gap query 1", "gap query 2"], M new docs
# step 2: None, 0 new docs, complete=True
Parameters
Retriever
| Parameter | Default | Description |
|---|---|---|
top_k | 4 | Number of chunks to return |
use_bm25 | False | Enable BM25 reranking |
bm25_weight | 0.3 | Weight for BM25 score in hybrid ranking |
metadata_filter | None | Filter by metadata key-value pairs |
HyDERetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for generating hypothetical answers |
prompt_template | built-in | Custom prompt (must include {query}) |
SelfQueryRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for query decomposition |
metadata_fields | — | List of metadata field names the LLM can filter on |
prompt | built-in | Custom decomposition prompt |
ParentDocumentRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
chunk_size | 200 | Characters per chunk |
chunk_overlap | 50 | Overlap between chunks |
CrossEncoderReranker
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
model | "cross-encoder/ms-marco-MiniLM-L-6-v2" | Cross-encoder model name |
fetch_k | 20 | Number of initial candidates to retrieve |
CRAGRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for grading and query rewriting |
relevance_threshold | 0.5 | Min fraction of docs that must be relevant |
max_retries | 1 | Max query rewrites before returning what we have |
QueryDecompositionRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for query decomposition |
num_sub_queries | 3 | Number of sub-queries to generate |
ContextualCompressionRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for document compression |
fetch_k | 10 | Number of candidates to retrieve before compression |
EnsembleRetriever
| Parameter | Default | Description |
|---|---|---|
retrievers | — | List of Retriever instances |
weights | equal | Weight for each retriever in RRF scoring |
rrf_k | 60 | RRF constant (higher = less aggressive reranking) |
CohereReranker
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
model | "rerank-v3.5" | Cohere rerank model name |
api_key | None | Cohere API key (falls back to CO_API_KEY env var) |
fetch_k | 20 | Number of initial candidates to retrieve |
StepBackRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for generating step-back questions |
prompt_template | built-in | Custom prompt (must include {query}) |
FLARERetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for answer generation |
max_iterations | 3 | Maximum generate-retrieve cycles |
generate_prompt | built-in | Prompt for initial answer generation |
regenerate_prompt | built-in | Prompt for regeneration with new context |
HybridSearchRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
bm25_weight | 0.5 | Weight for BM25 scores in RRF fusion |
vector_weight | 0.5 | Weight for vector scores in RRF fusion |
rrf_k | 60 | RRF constant (higher = less aggressive reranking) |
SelfRAGRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for grading, generation, and support checking |
max_iterations | 2 | Max retrieve-grade-generate-check cycles |
relevance_threshold | 0.5 | Min fraction of docs that must be graded relevant |
AdaptiveRAGRetriever
| Parameter | Default | Description |
|---|---|---|
llm | — | LLM for query classification |
simple_retriever | — | Retriever for simple queries |
moderate_retriever | None | Retriever for moderate queries (falls back to simple) |
complex_retriever | None | Retriever for complex queries (falls back to moderate) |
classify_prompt | built-in | Custom classification prompt |
MultiStepRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for answer generation and gap identification |
max_steps | 3 | Maximum retrieval-generation iterations |
GraphRAGRetriever
| Parameter | Default | Description |
|---|---|---|
retriever | — | Base Retriever instance |
llm | — | LLM for entity extraction |
knowledge_graph | None | KnowledgeGraph instance (falls back to vector-only if None) |
max_hops | 2 | Maximum graph traversal hops from extracted entities |