cairn-prod/Evaluations
— 14 suites, 2,802 cases

Evaluations

Write evals once. Run them on every PR, every nightly, and as a deploy gate.

scope · any status · active + Add filter
Sort: last run ⌄

helpfulness

support-triage · llm-judge
98.2%
1,240 caseslast run 6m ago+0.4 pts

json-format

invoice-extract · assertion
100%
840 caseslast run 12m agosteady

regression-q2

all-agents · golden-set
95.5%
312 caseslast run 1h ago−1.1 pts

brand-tone

docs-router · llm-judge
95.6%
410 caseslast run 3h ago+1.8 pts

refund-policy-§3

support-triage · rubric
97.0%
208 caseslast run 5h ago+0.9 pts

refusal-rate

all-agents · assertion
0.8%
3,218 caseslast run 6m ago−0.3 pts