Real metrics from real AI executions. Not marketing claims.
Each agent runs against curated test scenarios with known ground truth. We use real AI API calls (Anthropic Claude) with real cost and real latency. AI agents are non-deterministic — results vary across runs.
We evaluate on five dimensions: Cost, Latency, Efficacy, Assurance, Reliability. Pass@1 = first-attempt success rate. Pass@8 = 8 consecutive successes.
Benchmark datasets and scoring functions are open source. Trigger a run yourself via POST /api/v1/benchmarks/run.