Evaluation

SynapseKit includes built-in evaluation metrics for measuring the quality of RAG and LLM outputs. Inspired by RAGAS-style evaluation, these metrics help you quantify faithfulness, relevancy, and groundedness.

Metrics

FaithfulnessMetric

Measures whether the generated answer is faithful to the source documents (i.e., does not hallucinate):

from synapsekit import FaithfulnessMetric

metric = FaithfulnessMetric(llm=llm)

score = await metric.score(
    question="What is Python?",
    answer="Python is a compiled language created in 1991.",
    contexts=["Python is an interpreted language created by Guido van Rossum in 1991."],
)
# score → 0.5 (partially faithful — creation date correct, but "compiled" is wrong)

RelevancyMetric

Measures how relevant the answer is to the question asked:

from synapsekit import RelevancyMetric

metric = RelevancyMetric(llm=llm)

score = await metric.score(
    question="What is the capital of France?",
    answer="Paris is the capital and largest city of France.",
)
# score → 1.0 (highly relevant)

GroundednessMetric

Measures how well the answer is grounded in the retrieved context:

from synapsekit import GroundednessMetric

metric = GroundednessMetric(llm=llm)

score = await metric.score(
    answer="SynapseKit supports 31 LLM providers.",
    contexts=["SynapseKit supports 31 LLM providers including OpenAI, Anthropic, and Gemini."],
)
# score → 1.0 (fully grounded)

EvaluationPipeline

Run multiple metrics over a dataset in one call:

from synapsekit import EvaluationPipeline, FaithfulnessMetric, RelevancyMetric, GroundednessMetric

pipeline = EvaluationPipeline(
    metrics=[
        FaithfulnessMetric(llm=llm),
        RelevancyMetric(llm=llm),
        GroundednessMetric(llm=llm),
    ],
)

results = await pipeline.evaluate(
    questions=["What is RAG?", "How does SynapseKit work?"],
    answers=["RAG is retrieval-augmented generation.", "SynapseKit is a Python framework."],
    contexts=[
        ["RAG combines retrieval with generation for grounded answers."],
        ["SynapseKit is a Python library for building LLM applications."],
    ],
)

for r in results:
    print(r)
    # EvaluationResult(faithfulness=0.95, relevancy=0.90, groundedness=0.88, mean_score=0.91)

EvaluationResult

Each result contains per-metric scores and a convenience mean_score:

result = results[0]

print(result.faithfulness)   # 0.95
print(result.relevancy)      # 0.90
print(result.groundedness)   # 0.88
print(result.mean_score)     # 0.91

@eval_case decorator

Define evaluation test cases with quality, cost, and latency bounds. Works with both synapsekit test CLI and pytest.

from synapsekit import eval_case

@eval_case(min_score=0.8, max_cost_usd=0.05, max_latency_ms=2000, tags=["qa", "rag"])
async def eval_summarization():
    # Run your pipeline, measure quality
    return {"score": 0.85, "cost_usd": 0.02, "latency_ms": 1200}

@eval_case(min_score=0.9)
def eval_retrieval():
    return {"score": 0.92}

Parameters

Parameter	Type	Default	Description
`min_score`	`float \| None`	`None`	Minimum acceptable score (0.0-1.0)
`max_cost_usd`	`float \| None`	`None`	Maximum acceptable cost in USD
`max_latency_ms`	`float \| None`	`None`	Maximum acceptable latency in ms
`tags`	`list[str]`	`[]`	Tags for filtering and grouping

Return value

Decorated functions must return a dict with any of these keys:

Key	Type	Description
`score`	`float`	Quality score (checked against `min_score`)
`cost_usd`	`float`	Cost (checked against `max_cost_usd`)
`latency_ms`	`float`	Latency (auto-measured if omitted, checked against `max_latency_ms`)

EvalCaseMeta

The decorator attaches an EvalCaseMeta dataclass to the function as _eval_case_meta:

meta = eval_summarization._eval_case_meta
print(meta.min_score)      # 0.8
print(meta.max_cost_usd)   # 0.05
print(meta.tags)           # ["qa", "rag"]

Running eval cases

Use the synapsekit test CLI to discover and run eval cases:

synapsekit test tests/evals/ --threshold 0.8

EvalRegression

Snapshot-based regression detection for CI pipelines. Save eval results as named snapshots, then compare against baselines to catch regressions.

from synapsekit import EvalRegression

regression = EvalRegression(store_dir=".synapsekit_evals")

# Save a snapshot after running evals
results = [{"name": "qa_test", "score": 0.85, "cost_usd": 0.02, "latency_ms": 1200}]
regression.save_snapshot("v1.0", results)

# Later, compare against baseline
new_results = [{"name": "qa_test", "score": 0.80, "cost_usd": 0.03, "latency_ms": 1500}]
regression.save_snapshot("v1.1", new_results)

report = regression.compare("v1.0", "v1.1")
print(report.has_regressions)  # True — score dropped 5%
for delta in report.deltas:
    if delta.regressed:
        print(f"{delta.case_name}.{delta.metric}: {delta.baseline} → {delta.current}")

Default thresholds

Metric	Threshold	Description
`score`	-0.02 (2% drop)	Quality score regression
`cost_usd`	+0.10 (10% increase)	Cost regression
`latency_ms`	+0.20 (20% increase)	Latency regression

Custom thresholds

report = regression.compare("v1.0", "v1.1", thresholds={
    "score": -0.05,       # Allow up to 5% score drop
    "cost_usd": 0.20,     # Allow up to 20% cost increase
    "latency_ms": 0.50,   # Allow up to 50% latency increase
})

CLI integration

Use synapsekit test with regression flags for CI gates:

# Save results as a named snapshot
synapsekit test tests/evals/ --save v1.0

# Compare against baseline and fail on regression
synapsekit test tests/evals/ --compare v1.0 --fail-on-regression

# Custom snapshot directory
synapsekit test tests/evals/ --save v1.0 --snapshot-dir ./my_evals

See synapsekit test for full CLI reference.

Managing snapshots

# List saved snapshots
snapshots = regression.list_snapshots()
# ["v1.0", "v1.1"]

# Load a specific snapshot
snapshot = regression.load_snapshot("v1.0")
print(snapshot.name, snapshot.timestamp, len(snapshot.results))

Metrics​

FaithfulnessMetric​

RelevancyMetric​

GroundednessMetric​

EvaluationPipeline​

EvaluationResult​

@eval_case decorator​

Parameters​

Return value​

EvalCaseMeta​

Running eval cases​

EvalRegression​

Default thresholds​

Custom thresholds​

CLI integration​

Managing snapshots​

See also​