When ExoBench Gets It Wrong
Catching and fixing AI assumptions in benchmarks
ExoBench is AI-driven. The AI makes assumptions. Those assumptions are often wrong. Here's how to catch and fix that.
Step 1: Question the Schema the AI Constructed
After ExoBench runs a benchmark, ask:
"Show me the exact schema you used."
Read it. Does it match your actual schema? Common mistakes:
- AI added indexes you don't have (making the benchmark look optimistically fast)
- AI omitted indexes you DO have (making the benchmark look pessimistically slow)
- AI created a table when you actually have a view
- AI used wrong column types (TEXT vs VARCHAR, INT vs BIGINT)
- AI used different column names than your actual schema
If anything is wrong, tell it: "That's wrong. Here's my actual schema: [paste it]. Re-run the benchmark."
You don't have to paste your entire database schema. Just correct the parts that are wrong. If the AI assumed 3 columns but your table has 20, paste the real CREATE TABLE. If it's missing an index, add it. Iterate.
Step 2: Question the Data Generation
This is where most benchmarks go sideways. The AI generated data, but is it realistic?
Ask ExoBench to verify its own data:
"Run a SELECT COUNT(*) grouped by status on the generated data. What's the distribution?"
"What's the cardinality of category_id in the generated item_categories table?"
"Show me the min/max/avg of created_at in the items table."
"How many orders does the average customer have? What about the max?"
ExoBench can run ad-hoc queries on its generated data using the runSql tool. Use this. If the distribution is wrong, the benchmark is wrong.
Why this matters: As the data distribution example showed, a status column that's 90% 'completed' produces a completely different query plan than one evenly split across 5 values. The data distribution directly controls which plan the optimizer picks.
Common data issues to check for:
- "90% of items have status=1" but the AI generated an even split across 5 statuses
- "We have 50 categories but 2M products" but the AI scaled categories linearly alongside products
- "created_at is clustered in the last 2 years" but the AI spread timestamps uniformly over 10 years
- "customer_id follows a power-law distribution" but the AI assigned customers uniformly
If the distribution is wrong, tell the AI: "The distribution is wrong. 90% of items should have status=1, and created_at should be concentrated in the last 18 months, not spread over 10 years. Fix the data generation and re-run."
Step 3: Question the AI's Conclusions
The AI will look at EXPLAIN ANALYZE output and tell you what it thinks. Challenge it:
- "You said the index helps, but the runtime only dropped 10%. Is that within noise?"
- "You're comparing 10K rows to 50K rows. My production table has 5M. Can you extrapolate whether this index strategy holds at that scale?"
- "The plan shows Index Only Scan, but did you run VACUUM ANALYZE? Is the visibility map correct?"
- "You said this is a seq scan, but the estimated rows are way off from actual rows. Doesn't that mean the statistics are stale?"
- "You're recommending a covering index, but won't that slow down my writes? What's the trade-off?"
Step 4: Iterate
ExoBench benchmarks are cheap. Run many of them. Don't accept the first result.
A good workflow:
- Baseline. Benchmark your current query with whatever schema info you have.
- Hypothesis. "I think a composite index on (status, created_at DESC) will help."
- Test. Benchmark with the new index added.
- Validate. Check the plan changed the way you expected.
- Stress. Test at multiple data volumes in a single benchmark run. ExoBench supports up to 5 scale points per request, so you can test 50K, 100K, 200K, and 500K all at once. You don't need millions of rows. Most query plans fully stabilize by 300K-500K. ExoBench enforces a 2M row limit per scale point, but you'll rarely need anywhere near that.
- Edge case. Test with skewed data distributions that match your production reality.
If the results surprise you, go back to Steps 1-2 and check whether the schema and data match what you actually have.