Agent Studio Benchmark

Agentic Search Leaderboard

How do shopping agents actually perform? We measured quality, accuracy, safety, speed, and cost across major LLMs. Open methodology, open results.

-
Models
-
Test cases
6
Metrics
4
Providers
Scroll
Quality scores with 95% confidence bands
Loading leaderboard...

Six quality evals, two operational measures

Relevance

Right products, right answers. Does the response address the actual question with useful information?

Eval

Faithfulness

No hallucinations. Every claim is checked against real search results. Made-up facts get caught.

Eval

Safety

No harmful claims. Medical advice, unsafe assurances, age-restricted content. Tested before it reaches customers.

Eval

Compliance

Follows instructions. Format, style, tone, required elements. What you configure is what the agent does.

Eval

Disambiguation

Asks, doesn't guess. Vague query? The agent should clarify before searching, not dump products blindly.

Eval

Language

Correct language, no mixing. Tested across 12 languages. Wrong-language replies get flagged instantly.

Eval

Speed

End-to-end latency. Time-to-first-byte, time-to-first-token, and total. Real orchestration, not synthetic.

Metric

Cost

Actual cost per query. Real token consumption, real provider pricing. No estimates.

Metric

How it works

From catalog to confidence

Generate from your index

Real products, real queries. Datasets generated directly from your catalog with natural shopping intent and constraints.

Run through the agent

Live search, real orchestration. Every case runs end-to-end through Agent Studio with tool execution and full context.

Judge automatically

Calibrated LLM judges. 6 dimensions, inter-rater reliability ≥ 0.95. No human bottleneck, no subjectivity.

Compare and decide

Statistical leaderboards with confidence intervals. See which model delivers the best quality-cost-speed trade-off for your use case.

Compare every model, every dimension

Click any column to sort. Bands show 95% confidence intervals: if we ran this 100 times, 95 times the score would land in the band.

Same question, different quality

Pick a metric. See a real query from our dataset, sent to every model through Agent Studio. We show the best, the mediocre, and the worst — because the gap is the story.

Loading demo data...

How our judge scores a response

Pick a model. See the full completion with tool calls. Then reveal the judge's verdict — the criteria, the reasoning, the score.

Query
Agent response
Reveal verdict

Build your agent and measure how it performs. Same benchmarks, your data.

Try Agent Studio