Skip to main content

Evaluation

SynapseKit includes built-in evaluation metrics for measuring the quality of RAG and LLM outputs. Inspired by RAGAS-style evaluation, these metrics help you quantify faithfulness, relevancy, and groundedness.

Metrics

FaithfulnessMetric

Measures whether the generated answer is faithful to the source documents (i.e., does not hallucinate):

from synapsekit import FaithfulnessMetric

metric = FaithfulnessMetric(llm=llm)

score = await metric.score(
question="What is Python?",
answer="Python is a compiled language created in 1991.",
contexts=["Python is an interpreted language created by Guido van Rossum in 1991."],
)
# score → 0.5 (partially faithful — creation date correct, but "compiled" is wrong)

RelevancyMetric

Measures how relevant the answer is to the question asked:

from synapsekit import RelevancyMetric

metric = RelevancyMetric(llm=llm)

score = await metric.score(
question="What is the capital of France?",
answer="Paris is the capital and largest city of France.",
)
# score → 1.0 (highly relevant)

GroundednessMetric

Measures how well the answer is grounded in the retrieved context:

from synapsekit import GroundednessMetric

metric = GroundednessMetric(llm=llm)

score = await metric.score(
answer="SynapseKit supports 15 LLM providers.",
contexts=["SynapseKit supports 15 LLM providers including OpenAI, Anthropic, and Gemini."],
)
# score → 1.0 (fully grounded)

EvaluationPipeline

Run multiple metrics over a dataset in one call:

from synapsekit import EvaluationPipeline, FaithfulnessMetric, RelevancyMetric, GroundednessMetric

pipeline = EvaluationPipeline(
metrics=[
FaithfulnessMetric(llm=llm),
RelevancyMetric(llm=llm),
GroundednessMetric(llm=llm),
],
)

results = await pipeline.evaluate(
questions=["What is RAG?", "How does SynapseKit work?"],
answers=["RAG is retrieval-augmented generation.", "SynapseKit is a Python framework."],
contexts=[
["RAG combines retrieval with generation for grounded answers."],
["SynapseKit is a Python library for building LLM applications."],
],
)

for r in results:
print(r)
# EvaluationResult(faithfulness=0.95, relevancy=0.90, groundedness=0.88, mean_score=0.91)

EvaluationResult

Each result contains per-metric scores and a convenience mean_score:

result = results[0]

print(result.faithfulness) # 0.95
print(result.relevancy) # 0.90
print(result.groundedness) # 0.88
print(result.mean_score) # 0.91