The Query Too Simple to Fix

There's a query in every codebase that everyone knows about and nobody fixes. We benchmarked it at five data scales and found a 1,007x improvement hiding behind one composite index.

There's a query in every codebase that everyone knows about and nobody fixes. It's the K-Pop earworm of queries.

You've seen this query. I've seen this query. We've all seen a hundred like it. User activity feed. "Show me this user's last 10 actions." Filter by user, sort by date, limit 10. It's on every profile page, behind every activity log, inside every audit trail.

SELECT id, action, created_at
FROM events
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 10

About 1,000 distinct users. No indexes besides the primary key. At 100K rows it takes 5 milliseconds. Not terrible. At 500K rows it takes 35 milliseconds. At 1M rows, 66 milliseconds. That doesn't sound catastrophic until you multiply it by every user hitting their profile page, every audit trail loading, every dashboard refreshing. You're burning hours of compute on a query that returns 10 rows.

The EXPLAIN ANALYZE output showed a Parallel Seq Scan. Postgres scans the entire table, finds the ~1,000 rows matching user 42, sorts them, and returns 10. The full table, scanned and filtered, for 10 rows.


What Everyone Tries

Rewrite it. The query is already clean. There's nothing to restructure. Filter, sort, limit. The problem isn't the SQL. It's the access path.

Add an index. On what? user_id? That helps the filter but not the sort. created_at? That helps the sort but not the filter. Both? Which order? You're guessing based on intuition and an EXPLAIN output you're 60% sure you're reading correctly.

Ask the DBA. The DBA has 200 other queries to worry about. Their suggestion: "Have you tried adding an index?"

Paste it into ChatGPT. It suggests a composite index on (user_id, created_at DESC). That's probably right. But you won't know until you create the index on staging, load representative data, run EXPLAIN ANALYZE yourself, read the output, and confirm the plan changed the way you expected. The chatbot can't do any of that. It read the SQL syntax and made a reasonable guess. It has no way to prove the guess works, test it at different data volumes, or tell you what happens when your table grows from 100K to 1M rows.

Every one of these approaches has the same flaw: none of them close the loop between "I think this will help" and "here's proof that it did."

ExoBench closes the loop.


What ExoBench Actually Is

You describe your problem in plain English to your AI assistant, Claude, Cursor, ChatGPT, whatever supports MCP. The AI calls ExoBench behind the scenes. ExoBench spins up a real Postgres instance, generates synthetic data at the scales you specified, runs your query with EXPLAIN ANALYZE, and returns real execution plans with real timings. Then the AI reads the results, forms a hypothesis, calls ExoBench again, and iterates.

You don't interact with ExoBench directly. You talk to your AI assistant. ExoBench is the benchmarking engine that gives the AI the ability to prove things instead of guess. If you want the full architecture, here's how it works.

The difference between ExoBench and pasting your query into a chatbot is the difference between suggesting that a bridge can hold weight and actually loading it until it breaks. One is an opinion. The other is an engineering test. Here's the detailed comparison.

I described the events query to my AI assistant and said: "This is slow at scale. About 1,000 distinct users. No indexes besides the primary key. Find me the right index. Test at 10K, 50K, 100K, 250K, and 500K rows."

Here's what happened.


First Benchmark: The Composite Index

The AI called ExoBench twice. Once without any indexes (baseline), once with a composite index on (user_id, created_at DESC).

RowsWithout IndexWith IndexSpeedup
10K0.54 ms0.04 ms14x
50K2.50 ms0.07 ms36x
100K5.12 ms0.04 ms128x
250K25.76 ms0.04 ms644x
500K40.26 ms0.04 ms1,007x

Without the index, Postgres scans the entire table. The cost grows linearly. At 250K rows it even spins up parallel workers to compensate, but it still takes 25ms because it's scanning everything to find ~50 matching rows.

With the index on (user_id, created_at DESC), Postgres jumps directly to user 42's section in the B-tree, reads 10 entries in descending order, and stops. The execution time is flat at 0.04ms regardless of table size. At 500K rows that's a 1,007x improvement.

One benchmark. Real EXPLAIN ANALYZE. Real Postgres. Five scale points in a single request. The AI didn't guess that this index would help. It proved it, and it proved it holds at every data volume.

That scaling table is the thing a chatbot can never give you. Not "this index should help." This index reduced execution time from 40ms to 0.04ms at 500K rows, and here are the plans that prove it.

Problem solved. Ship it.


Second Benchmark: When They Ask for More

Six months later, someone on the product team says: "Can we show the full activity feed? Not just the last 10. The last 1,000 events. With infinite scroll."

LIMIT goes from 10 to 1,000. Your composite index is still there. It still helps. But something changed.

I asked ExoBench: "Same query, same composite index, but LIMIT 1000 instead of 10. Test at 100K, 500K, and 1M rows."

RowsNo IndexComposite IndexImprovement
100K4.65 ms0.24 ms19x
500K35.6 ms2.69 ms13x
1M66.0 ms1.75 ms38x

The composite index is still helping. 66ms down to 1.75ms at 1M rows. That's a 38x improvement. Not bad. But look at what happened to the plan.

At LIMIT 10, Postgres did an Index Scan, read 10 entries, and stopped. Flat. At LIMIT 1000, it switched to a Bitmap Heap Scan. It uses the index to find user 42's rows, but then it goes back to the table to fetch id and action for each of those ~1,000 rows. Each fetch is a random I/O to the heap. The plan shows 939 heap blocks read at 1M rows. That's the bottleneck.

At LIMIT 10, those heap fetches were invisible, 10 random I/Os cost microseconds. At LIMIT 1,000, they're the dominant cost.

The fix: a covering index. Add the columns the query needs to the index with INCLUDE, so Postgres never touches the table at all.

CREATE INDEX idx_events_user_created
ON events (user_id, created_at DESC) INCLUDE (id, action);

I asked ExoBench to benchmark all three: no index, composite, and covering. Same query, LIMIT 1000.

RowsNo IndexComposite IndexCovering Index
100K4.65 ms0.24 ms0.09 ms
500K35.6 ms2.69 ms0.18 ms
1M66.0 ms1.75 ms0.36 ms

Index Only Scan. Zero heap fetches. Postgres reads everything from the B-tree, walks to user 42's section, reads 1,000 entries in order, and never touches the table. At 1M rows the covering index is 5x faster than the composite index and 183x faster than no index.

The composite index was good enough at LIMIT 10. At LIMIT 1,000, it hits a ceiling. The covering index breaks through it.

That's the kind of thing you only discover by benchmarking at the actual LIMIT your users are requesting, not the LIMIT you launched with six months ago. A chatbot would have given you the composite index and stopped. ExoBench showed you where it stops being enough and what to do about it.


Third Benchmark: The Distribution Trap

This is the one that surprised me.

I asked: "Different query, same idea. We have a transactions table. 90% of rows are status='completed', 5% pending, 5% failed. We have an index on status. Why is the same aggregation query so much slower when filtering by 'completed'?"

ExoBench generated data matching that distribution: 500K transactions, 90% completed, 5% pending, 5% failed. Then ran the same query twice, once filtering by 'failed', once by 'completed'.

FilterMatching rowsPlanTime
status = 'failed'~25,000 (5%)Bitmap Heap Scan (index used)22.0 ms
status = 'completed'~450,000 (90%)Parallel Seq Scan (index ignored)253.3 ms

Same query. Same index. 11.5x performance difference. The planner knows the statistics. When only 5% of rows match, the index is selective enough that Postgres uses it. When 90% match, the index is nearly useless, so Postgres ignores it and scans the entire table sequentially.

A chatbot would have said "add an index on status." ExoBench shows you that the index only helps for the minority values. For the majority value, you need a different strategy entirely.

I never would have tested this manually. Setting up two different data distributions, loading them, running the query, capturing both plans, comparing, that's an afternoon of work per distribution. ExoBench did it in one benchmark request because I told it "90% completed, 5% pending, 5% failed" and it generated the data to match.

This is what I mean by "proving, not guessing." The chatbot would have given me an index that works for 5% of my queries and doesn't help for the other 95%. ExoBench showed me that distinction before I deployed anything.


Where It Doesn't Help

Now here's where I tell you what it can't do. This part matters.

Synthetic data is not production data. ExoBench generates data matching the distributions you specify. But it can't capture every correlation, every seasonal pattern, every edge case in your real dataset. If you say "90% completed," it generates 90% completed. If your real data is 90% completed in Q4 but 60% in Q1, and you don't mention that, the benchmark won't reflect it. The benchmark is as realistic as the information you give it.

The AI infers things. It can be wrong. If you give ExoBench just a query without a schema, the AI will infer a plausible schema. It might add indexes you don't have, omit ones you do, or use different column types. The less you tell it, the more it guesses. Always check: "Show me the exact schema you used." If it's wrong, correct it and re-run. ExoBench benchmarks are cheap. Run many of them.

ExoBench optimizes what you give it. That 66ms events query? Maybe it shouldn't exist. Maybe the activity feed should be pre-computed and updated incrementally instead of queried live on every page load. ExoBench won't tell you that. It doesn't know your business logic. It takes the query you give it and finds the fastest way to run it. If the query is solving the wrong problem, you get a very fast wrong answer.

It's probabilistic. This is an LLM calling a benchmarking service. Run it twice and it might explore a different optimization path. Both paths might be good. Neither is guaranteed to be globally optimal. ExoBench finds a better point in the optimization space, not necessarily the best point.

It doesn't fix your infrastructure. If the bottleneck is network latency between your application server and the database, or insufficient memory for Postgres to hold its working set, or your connection pool maxing out under load, ExoBench can't help. It benchmarks query execution on an ephemeral Postgres instance. Not your deployment topology.

Ephemeral databases have scale limits. ExoBench enforces a maximum of 5 scale points per request and 2 million rows per scale point. Most query plans fully stabilize by 300K-500K rows, so this rarely matters. But if your production table has 50 million rows with complex partitioning, ExoBench can show you the trend, not the exact production behavior.

These are real limits. Some are fundamental. Some are engineering problems we're actively working on. I'd rather you know them before you start than discover them after you've formed expectations.


Try It

You already have Claude, or Cursor, or whatever MCP-compatible client you use.

ExoBench plugs in as an MCP server. You don't install a new application. You don't learn a new interface. You connect ExoBench, paste your query, describe your schema, tell the AI your row counts and any data distribution that matters. The AI does the rest.

Not sure which query to start with? Run this:

SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;

The top result is your candidate. That's the query ExoBench would start with. You probably already know which one it is. Now you have the number to prove it.

Paste that query into your AI assistant. Tell it your schema. Say "benchmark this at 10K, 100K, and 500K rows." See what happens.

I managed thousands of queries like this at a Fortune 500. If ExoBench had existed then, I'd have spent a lot less time staring at EXPLAIN ANALYZE output at 2AM and a lot more time doing literally anything else. The suffering built character, sure. But I'd trade the character for the sleep.