ExoBench Now Runs in Your Browser

No install, no setup. Paste a slow query at exobench.ai and watch it spin up a real Postgres at 100K, 1M, and 3M row and find a good index.

Summary for the Impatient

ExoBench now runs as a chatbot at exobench.ai — no install, no setup. Paste a slow SQL query and an AI agent calls ExoBench, which spins up a real PostgreSQL database at multiple scales (100K, 1M, 3M rows), runs EXPLAIN ANALYZE, and tries indexes until the plan stops improving. The lesson: query plans change with scale, so the "obvious" index can win at small scale and lose at large scale — in the demo, a partial index sped the query up at 100K rows but made it slower at 3M, while a covering index stayed 4–7x faster everywhere. A naive LLM guesses an index from the SQL; ExoBench measures it at production scale.

Nobody Follows the Query Planner Plot

There's a Christopher Nolan movie rule: nobody really follows the plot. You nod along for the whole film and agree that some kind of ineffable intelligence strung together the scenes. Databases run on the same rule.

Nobody really understands how a query plan scales. We poke it with indexes until it gets fast.

Say it at a standup and watch every senior engineer quietly nod. The planner is a box that changes its mind as your tables grow and the workflow is: add an index, run it, check if the number dropped, and repeat until you get something working. The problem is that you cannot poke it at production scale so you play around with it on your laptop, double check it doesn't break in staging and quietly ship it, hoping and praying that production data lives in the same reality.

ExoBench does the poking, the scale, and even understands the plot well enough to give you back the right index. It now runs as a chatbot at exobench.ai. You paste in a slow query, an AI agent calls ExoBench, and ExoBench spins up a real database, multiple scales! At each scale, it runs EXPLAIN ANALYZE and tries one index after another until the plan stops improving. You watch, or you don't. The numbers below are the default demo, on PostgreSQL across 100K, 1M, and 3M rows.

The query is the kind we all have shipped:

SELECT id, customer_name, total, created_at
FROM orders
WHERE status = 'pending'
ORDER BY created_at DESC;

A pending-orders list, newest first. It looks innocent but gets slower as the table grows in a different way at each size. Nobody with sprint deliverables is going to track this in their head.

The Baseline is Three Plans

The agent measures the query as written, primary key index only, across three scales at once. Here's what shows up in the chatbot:

Benchmark SQL#1 of 3
benchmarkSql complete
MCP Tool
postgres21.25s computewarm ×5
ScaleTimePlanCompute
100K10.2 msSeq Scan + Sort3.09s
1M62.9 msParallel Seq Scan + Sort + Gather Merge4.83s
3M185.3 msParallel Seq Scan + Sort + Gather Merge13.34s

Here comes the plan-reading. You can skim it. Seriously, skim it, I'll meet you at the number.

At 100K it's a sequential scan and an in-memory sort. At 1M Postgres decides to go parallel, because sure, why not? At 3M it stays parallel and the sort runs out of memory and spills to disk, which is the database sighing heavily. Three plans, one query. The planner changes its mind twice without telling you why.

Follow all of that? Doesn't matter. Here's the number: 10.2 ms, then 62.9 ms, then 185.3 ms. The line goes the wrong way. An LLM reading the SQL sees one query and one plan. ExoBench saw three, because it ran three.

If you want the why (nerds, welcome): ten percent of the rows are pending, the scan reads all of them anyway, and the sort is the expensive part. The fix keeps just the pending rows in date order. If you don't want the why: an index goes here. Onward.

The Obvious Index traps at Scale

The agent adds the index any of us would add: a partial index. The one you'd swear by.

CREATE INDEX idx_pending_created
ON orders (created_at DESC)
WHERE status = 'pending';

Only here's what happens:

Benchmark SQL#2 of 3
benchmarkSql complete
MCP Tool
postgres19.26s computewarm ×5
ScaleTimePlanCompute
100K6.7 msBitmap Heap Scan + Bitmap Index Scan + Sort0.91s
1M83 msBitmap Heap Scan + Bitmap Index Scan + Sort4.76s
3M245 msBitmap Heap Scan + Bitmap Index Scan + Sort13.59s

At 100K it worked. 10.2 down to 6.7. If your staging box holds a hundred thousand rows, you ship this, close the ticket, and call it a day.

At 3M it made things worse. 245.0 ms, slower than the 185.3 ms you started with. The obvious index, the one we were all so sure about, loses to doing nothing.

Cool. Cool cool cool.

The skippable reason: the planner picked a Bitmap Index Scan, which throws away the ordering the index existed to provide, so it sorts anyway and spills anyway. The unskippable reason it matters: you would have shipped this. It looked great at staging-scale. Nobody sees the flip coming, because the planner is ineffable... except for ExoBench.

Presto! The plan goes Flat

Now here's were ExoBench starts earning its keep. It feeds real stats back to your agent, the agent shrugs and keeps thinking. Next thing it tries is making the index covering, so the database never touches the table.

CREATE INDEX idx_pending_cov
ON orders (created_at DESC)
INCLUDE (id, customer_name, total)
WHERE status = 'pending';

Here's what happens:

Benchmark SQL#3 of 3
benchmarkSql complete
MCP Tool
postgres17.84s computewarm ×5
ScaleTimePlanCompute
100K1.5 msIndex Only Scan0.77s
1M14.9 msIndex Only Scan4.33s
3M44.8 msIndex Only Scan12.74s

Index Only Scan at every scale. 1.5 ms, 14.9, 44.8. Heap Fetches: 0, no sort, no spill, the same flat green plan whether you hold a hundred thousand rows or three million.

Why does covering the index fix it? Skim away: the database walks the index in order and reads the selected columns straight out of it, so nothing is left to sort and nothing is left to fetch. Or the version you came for: it's fast now. You're done.

The Plot Synopsis

One query. Three index candidates. Three scales. Nine benchmarks. Here's the entire movie, and you can nod along.

One query, three index candidates, three scales. The obvious partial index is the trap; the covering index is the flat line.
Baseline (PK only): 10.2ms at 100K, 62.9ms at 1M, 185.3ms at 3M. Plain partial index: 6.7ms at 100K, 83.0ms at 1M, 245.0ms at 3M — slower than baseline at 1M and 3M. Covering partial index: 1.5ms at 100K, 14.9ms at 1M, 44.8ms at 3M — flat and fastest at every scale.
Baseline climbs from 10.2ms to 185.3ms. The obvious partial index is WORSE at scale (83ms, 245ms). The covering index stays flat and fast (1.5ms, 14.9ms, 44.8ms).
Baseline (PK only): 10.2ms at 100K, 62.9ms at 1M, 185.3ms at 3M. Plain partial index: 6.7ms at 100K, 83.0ms at 1M, 245.0ms at 3M — slower than baseline at 1M and 3M. Covering partial index: 1.5ms at 100K, 14.9ms at 1M, 44.8ms at 3M — flat and fastest at every scale.

The final results, with the speedup over baseline:

ScaleBaseline (PK only)Partial indexCovering index
100K10.2 ms6.7 ms1.5 ms (6.8x)
1M62.9 ms83.0 ms14.9 ms (4.2x)
3M185.3 ms245.0 ms44.8 ms (4.1x)

The winning index, no change to your query:

CREATE INDEX idx_pending_cov
ON orders (created_at DESC)
INCLUDE (id, customer_name, total)
WHERE status = 'pending';

The LLM gets the Plot Wrong

A naive LLM would read your SQL and recommend an index with total confidence and it would recommend the partial one, the obvious one. It's the same one I'd have grabbed, and stopped there. It's the guy at your party explaining the Nolan plot in full detail getting it completely backwards because he only knows anything about it from a reddit post.

ExoBench watched the movie three times over, watched the Bitmap Index Scan drop the ordering, watched the sort spill, found the trap, and landed on the index whose plan stays flat.

An LLM guesses. ExoBench measures.

The Limits (read this one)

Everything else here is skippable. The limits are not, because here's where I tell you what ExoBench won't do.

The data is synthetic. ExoBench generated the pending/shipped/delivered split the demo describes, and a benchmark is only as honest as the distribution you give it. The agent guesses your schema when you don't hand it one, and it guesses wrong sometimes, so double check what it used! It benchmarks the query you give it, so a fast version of a bad query is still a bad query. Two runs can take different routes and both be fine. ExoBench leaves your infrastructure alone, so connection pools, memory pressure, and network latency stay your problem. The instances cap at a handful of scales and a few million rows each.

You still have to verify before you ship. The difference is you now have a number at production scale to verify against, in minutes, while sipping coffee.

Run your slowest query

Postgres users, your candidate is one query away:

SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;

Top row is the one. Go to exobench.ai, watch the demo poke this query until it's fast, then paste your own into the same box. Sign in with GitHub to run yours, one click, the same identity the connector uses. No install, no setup.

You don't have to understand how your query plan scales. ExoBench makes it fast. You can skim the rest.