Examples

Real-world EvalCI workflow configurations and eval case patterns.

RAG pipeline quality gate

The most common use case — gate on relevancy and faithfulness scores for a RAG pipeline.

# .github/workflows/eval.yml
name: EvalCI

on:
  pull_request:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: SynapseKit/evalci@v1
        with:
          path: tests/evals
          threshold: "0.80"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# tests/evals/test_rag.py
from synapsekit import eval_case, RelevancyMetric, FaithfulnessMetric
from myapp.rag import pipeline, llm

@eval_case(min_score=0.80, max_cost_usd=0.02, tags=["rag"])
async def test_rag_relevancy():
    result = await pipeline.ask("What is the return policy?")
    metric = RelevancyMetric(llm=llm)
    score = await metric.score(
        question="What is the return policy?",
        answer=result.answer,
    )
    return {"score": score, "cost_usd": result.cost_usd, "latency_ms": result.latency_ms}

@eval_case(min_score=0.75, tags=["rag"])
async def test_rag_faithfulness():
    result = await pipeline.ask("What are the shipping options?")
    metric = FaithfulnessMetric(llm=llm)
    score = await metric.score(
        question="What are the shipping options?",
        answer=result.answer,
        contexts=result.source_documents,
    )
    return {"score": score, "cost_usd": result.cost_usd}

Multi-provider setup

Run evals across OpenAI and Anthropic in one workflow:

- uses: SynapseKit/evalci@v1
  with:
    path: tests/evals
    threshold: "0.75"
    extras: "openai,anthropic"
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

# tests/evals/test_multi_provider.py
import os
from synapsekit import eval_case, OpenAILLM, AnthropicLLM, RelevancyMetric

question = "Summarise the benefits of retrieval-augmented generation."

@eval_case(min_score=0.80, tags=["openai"])
async def test_openai_summary():
    llm = OpenAILLM(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
    answer = await llm.generate(question)
    metric = RelevancyMetric(llm=llm)
    score = await metric.score(question=question, answer=answer)
    return {"score": score}

@eval_case(min_score=0.80, tags=["anthropic"])
async def test_anthropic_summary():
    llm = AnthropicLLM(model="claude-haiku-4-5-20251001", api_key=os.environ["ANTHROPIC_API_KEY"])
    answer = await llm.generate(question)
    metric = RelevancyMetric(llm=llm)
    score = await metric.score(question=question, answer=answer)
    return {"score": score}

Using outputs in downstream steps

Fail the job with a custom message, or use scores to trigger notifications:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: SynapseKit/evalci@v1
        id: eval
        with:
          path: tests/evals
          threshold: "0.80"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Fail with summary
        if: failure()
        run: |
          echo "EvalCI failed: ${{ steps.eval.outputs.passed }}/${{ steps.eval.outputs.total }} passed"
          echo "Mean score: ${{ steps.eval.outputs.mean-score }}"
          exit 1

      - name: Post to Slack on pass
        if: success()
        run: |
          curl -s -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H "Content-Type: application/json" \
            -d "{\"text\":\"✅ EvalCI passed: ${{ steps.eval.outputs.passed }}/${{ steps.eval.outputs.total }} cases, mean score ${{ steps.eval.outputs.mean-score }}\"}"

Pinned version + strict threshold

For production branches, pin the synapsekit version and tighten the threshold:

name: EvalCI (strict)

on:
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: SynapseKit/evalci@v1
        with:
          path: tests/evals
          threshold: "0.90"
          synapsekit-version: "1.5.2"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Local run before pushing

Test your eval cases locally before opening a PR:

# Install synapsekit
pip install synapsekit[openai]

# Run all eval cases
synapsekit test tests/evals/ --threshold 0.80

# Run only RAG evals
synapsekit test tests/evals/ --tag rag --threshold 0.80

# Output as JSON
synapsekit test tests/evals/ --format json --threshold 0.80

Agent evaluation

- uses: SynapseKit/evalci@v1
  with:
    path: tests/evals/agents
    threshold: "0.75"
    extras: "openai"
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# tests/evals/agents/test_support_agent.py
from synapsekit import eval_case
from myapp.agents import support_agent

@eval_case(min_score=0.85, max_latency_ms=8000, tags=["agent"])
async def test_password_reset_response():
    response = await support_agent.run("How do I reset my password?")
    keywords = ["reset", "password", "email", "link", "account"]
    score = sum(1 for k in keywords if k in response.content.lower()) / len(keywords)
    return {"score": score, "latency_ms": response.latency_ms}

@eval_case(min_score=0.80, tags=["agent"])
async def test_refund_policy_response():
    response = await support_agent.run("What is your refund policy?")
    keywords = ["refund", "days", "policy", "contact", "eligible"]
    score = sum(1 for k in keywords if k in response.content.lower()) / len(keywords)
    return {"score": score, "cost_usd": response.cost_usd}

RAG pipeline quality gate​

Multi-provider setup​

Using outputs in downstream steps​

Pinned version + strict threshold​

Local run before pushing​

Agent evaluation​

See also​