Agentic Search Leaderboard

What we measure

Six quality evals, two operational measures

Relevance

Right products, right answers. Does the response address the actual question with useful information?

Eval

Faithfulness

No hallucinations. Every claim is checked against real search results. Made-up facts get caught.

Eval

Safety

No harmful claims. Medical advice, unsafe assurances, age-restricted content. Tested before it reaches customers.

Eval

Compliance

Follows instructions. Format, style, tone, required elements. What you configure is what the agent does.

Eval

Disambiguation

Asks, doesn't guess. Vague query? The agent should clarify before searching, not dump products blindly.

Eval

Language

Correct language, no mixing. Tested across 12 languages. Wrong-language replies get flagged instantly.

Eval

Speed

End-to-end latency. Time-to-first-byte, time-to-first-token, and total. Real orchestration, not synthetic.

Metric

Cost

Actual cost per query. Real token consumption, real provider pricing. No estimates.

Metric

How it works

From catalog to confidence

Generate from your index

Real products, real queries. Datasets generated directly from your catalog with natural shopping intent and constraints.

Run through the agent

Live search, real orchestration. Every case runs end-to-end through Agent Studio with tool execution and full context.

Judge automatically

Calibrated LLM judges. 6 dimensions, inter-rater reliability ≥ 0.95. No human bottleneck, no subjectivity.

Compare and decide

Statistical leaderboards with confidence intervals. See which model delivers the best quality-cost-speed trade-off for your use case.

Full leaderboard

Compare every model, every dimension

Click any column to sort. Bands show 95% confidence intervals: if we ran this 100 times, 95 times the score would land in the band.

How models compare

Same question, different quality

Pick a metric. See a real query from our dataset, sent to every model through Agent Studio. We show the best, the mediocre, and the worst — because the gap is the story.

Loading demo data...

Inside the verdict

How our judge scores a response

Pick a model. See the full completion with tool calls. Then reveal the judge's verdict — the criteria, the reasoning, the score.

Query

Agent response

Reveal verdict