Best Practices

When ExoBench Gets It Wrong

Catching and fixing AI assumptions in benchmarks

ExoBench is AI-driven. The AI makes assumptions. Those assumptions are often wrong. Here's how to catch and fix that.

Step 1: Question the Schema the AI Constructed

After ExoBench runs a benchmark, ask:

"Show me the exact schema you used."

Read it. Does it match your actual schema? Common mistakes:

AI added indexes you don't have (making the benchmark look optimistically fast)
AI omitted indexes you DO have (making the benchmark look pessimistically slow)
AI created a table when you actually have a view
AI used wrong column types (TEXT vs VARCHAR, INT vs BIGINT)
AI used different column names than your actual schema

If anything is wrong, tell it: "That's wrong. Here's my actual schema: [paste it]. Re-run the benchmark."

You don't have to paste your entire database schema. Just correct the parts that are wrong. If the AI assumed 3 columns but your table has 20, paste the real CREATE TABLE. If it's missing an index, add it. Iterate.

Step 2: Question the Data Generation

This is where most benchmarks go sideways. The AI generated data, but is it realistic?

Ask ExoBench to verify its own data:

"Run a SELECT COUNT(*) grouped by status on the generated data. What's the distribution?"

"What's the cardinality of category_id in the generated item_categories table?"

"Show me the min/max/avg of created_at in the items table."

"How many orders does the average customer have? What about the max?"

ExoBench can run ad-hoc queries on its generated data using the runSql tool. Use this. If the distribution is wrong, the benchmark is wrong.

Why this matters: As the data distribution example showed, a status column that's 90% 'completed' produces a completely different query plan than one evenly split across 5 values. The data distribution directly controls which plan the optimizer picks.

Common data issues to check for:

"90% of items have status=1" but the AI generated an even split across 5 statuses
"We have 50 categories but 2M products" but the AI scaled categories linearly alongside products
"created_at is clustered in the last 2 years" but the AI spread timestamps uniformly over 10 years
"customer_id follows a power-law distribution" but the AI assigned customers uniformly

If the distribution is wrong, tell the AI: "The distribution is wrong. 90% of items should have status=1, and created_at should be concentrated in the last 18 months, not spread over 10 years. Fix the data generation and re-run."

Step 3: Question the AI's Conclusions

The AI will look at EXPLAIN ANALYZE output and tell you what it thinks. Challenge it:

"You said the index helps, but the runtime only dropped 10%. Is that within noise?"
"You're comparing 10K rows to 50K rows. My production table has 5M. Can you extrapolate whether this index strategy holds at that scale?"
"The plan shows Index Only Scan, but did you run VACUUM ANALYZE? Is the visibility map correct?"
"You said this is a seq scan, but the estimated rows are way off from actual rows. Doesn't that mean the statistics are stale?"
"You're recommending a covering index, but won't that slow down my writes? What's the trade-off?"

Step 4: Iterate

ExoBench benchmarks are cheap. Run many of them. Don't accept the first result.

A good workflow:

Baseline. Benchmark your current query with whatever schema info you have.
Hypothesis. "I think a composite index on (status, created_at DESC) will help."
Test. Benchmark with the new index added.
Validate. Check the plan changed the way you expected.
Stress. Test at multiple data volumes in a single benchmark run. ExoBench supports up to 5 scale points per request, so you can test 50K, 100K, 200K, and 500K all at once. You don't need millions of rows. Most query plans fully stabilize by 300K-500K. ExoBench enforces a 2M row limit per scale point, but you'll rarely need anywhere near that.
Edge case. Test with skewed data distributions that match your production reality.

If the results surprise you, go back to Steps 1-2 and check whether the schema and data match what you actually have.