— 14 suites, 2,802 cases
Evaluations
Write evals once. Run them on every PR, every nightly, and as a deploy gate.
helpfulness
support-triage · llm-judge
98.2%
1,240 caseslast run 6m ago+0.4 pts
json-format
invoice-extract · assertion
100%
840 caseslast run 12m agosteady
regression-q2
all-agents · golden-set
95.5%
312 caseslast run 1h ago−1.1 pts
brand-tone
docs-router · llm-judge
95.6%
410 caseslast run 3h ago+1.8 pts
refund-policy-§3
support-triage · rubric
97.0%
208 caseslast run 5h ago+0.9 pts
refusal-rate
all-agents · assertion
0.8%
3,218 caseslast run 6m ago−0.3 pts