Skip to main content

EvalCI — LLM Quality Gates

EvalCI is a GitHub Action that runs your @eval_case suites on every pull request and blocks merge if quality drops below threshold.

No infrastructure. No backend. 2-minute setup.

- uses: SynapseKit/evalci@v1
with:
path: tests/evals
threshold: "0.80"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

That's it. EvalCI installs SynapseKit into the runner, discovers your eval cases, runs them, posts a results table as a PR comment, and fails the check if any case falls below threshold.


Why EvalCI

LLM applications degrade silently. A prompt change, a model update, a retrieval tweak — any of these can drop quality by 10–20% without a single test failure. EvalCI gives you a quality gate that catches this before it ships.

Without EvalCIWith EvalCI
Quality regressions ship to productionBlocked at PR review
Manual eval runs, inconsistentAutomatic on every PR
No visibility into cost/latency trendsScore, cost, latency per case on every PR
Requires external tooling (LangSmith, etc.)Works in your existing GitHub Actions

How it works

Your PR opens


EvalCI Action runs

├─ pip install synapsekit[{extras}]

├─ synapsekit test {path} --format json --threshold {threshold}
│ │
│ ├─ Discovers all @eval_case functions
│ ├─ Runs each case, measures score / cost / latency
│ └─ Outputs JSON results

├─ Parses results

├─ Posts PR comment with results table

├─ Sets Action outputs: passed, failed, total, mean-score

└─ Exit 0 (all pass) or 1 (any failure)

PR comment

On every PR, EvalCI posts a comment like this:

EvalCI Results

TestScoreCostLatency
test_rag_relevancy0.850$0.00501200ms
test_rag_faithfulness0.650$0.01202500ms

1/2 passed · Threshold: 0.80 · SynapseKit EvalCI


Next steps