Transparent Agent Benchmarks

Real metrics from real AI executions. Not marketing claims.

How We Measure

Each agent runs against curated test scenarios with known ground truth. We use real AI API calls (Anthropic Claude) with real cost and real latency. AI agents are non-deterministic — results vary across runs.

CLEAR Framework

We evaluate on five dimensions: Cost, Latency, Efficacy, Assurance, Reliability. Pass@1 = first-attempt success rate. Pass@8 = 8 consecutive successes.

Reproducible

Benchmark datasets and scoring functions are open source. Trigger a run yourself via POST /api/v1/benchmarks/run.