How do shopping agents actually perform? We measured quality, accuracy, safety, speed, and cost across major LLMs. Open methodology, open results.
What we measure
Right products, right answers. Does the response address the actual question with useful information?
EvalNo hallucinations. Every claim is checked against real search results. Made-up facts get caught.
EvalNo harmful claims. Medical advice, unsafe assurances, age-restricted content. Tested before it reaches customers.
EvalFollows instructions. Format, style, tone, required elements. What you configure is what the agent does.
EvalAsks, doesn't guess. Vague query? The agent should clarify before searching, not dump products blindly.
EvalCorrect language, no mixing. Tested across 12 languages. Wrong-language replies get flagged instantly.
EvalEnd-to-end latency. Time-to-first-byte, time-to-first-token, and total. Real orchestration, not synthetic.
MetricActual cost per query. Real token consumption, real provider pricing. No estimates.
MetricHow it works
Real products, real queries. Datasets generated directly from your catalog with natural shopping intent and constraints.
Live search, real orchestration. Every case runs end-to-end through Agent Studio with tool execution and full context.
Calibrated LLM judges. 6 dimensions, inter-rater reliability ≥ 0.95. No human bottleneck, no subjectivity.
Statistical leaderboards with confidence intervals. See which model delivers the best quality-cost-speed trade-off for your use case.
Full leaderboard
Click any column to sort. Bands show 95% confidence intervals: if we ran this 100 times, 95 times the score would land in the band.
How models compare
Pick a metric. See a real query from our dataset, sent to every model through Agent Studio. We show the best, the mediocre, and the worst — because the gap is the story.
Inside the verdict
Pick a model. See the full completion with tool calls. Then reveal the judge's verdict — the criteria, the reasoning, the score.
Build your agent and measure how it performs. Same benchmarks, your data.