Why is my Whale Query 1000x Slower in Prod?

Ten enterprise customers own 80% of the orders. On uniform test data the query runs in ~1ms; on that real skew it runs ~1,000x slower. The fix doesn't transfer between engines: Postgres is rescued by one free session setting, SQL Server can't be touched by it and needs a completely different access-path fix. Measured on Neon and Azure SQL Database using ExoBench.

May 26, 2026

Summary

A skewed foreign key, where 10 enterprise "whale" customers own 80% of 500,000 orders, makes the same three-table join run ~1,000x slower than the identical query on uniform test data: ~1ms becomes 1,572ms on Postgres and 2,340ms on SQL Server, measured on Neon and Azure SQL Database using ExoBench. Both planners underestimate the top join by ~3,000x for the identical reason: a single-column histogram can't see join-crossing correlation. But the fix does not transfer. Postgres drops to ~310ms with one free session setting (SET enable_nestloop = off) and no schema change. On SQL Server that entire class of join hints can't break a ~2,000ms floor, because its bottleneck is a 720K-row key lookup in the access path, not the join algorithm. SQL Server only reaches ~330ms by combining three things at once: a covering index, batch mode, and a hash join. Even the textbook structural fix, splitting the skewed table so the optimizer's statistics are honest, lands at 193ms on SQL Server but only 391ms on Postgres, where it loses to the free one-liner. Every database hates skew, and each hates it differently.

The harmless whale query that Will Eat You

Here's the entire setup. one prompt to my AI assistant:

I have an e-commerce schema where a small number of "enterprise" customers generate most of the order volume. I want to benchmark how Postgres and SQL Server handle this query at scale.

-- Schema
CREATE TABLE customers (id INT PRIMARY KEY, name TEXT, tier TEXT);
CREATE TABLE orders (id INT PRIMARY KEY, customer_id INT, total NUMERIC);
CREATE TABLE order_items (id INT PRIMARY KEY, order_id INT, quantity INT, unit_price NUMERIC);

Distribution: 10 customers are enterprise, rest split among free/starter/pro. 80% of all orders belong to the 10 enterprise customers. Order items spread uniformly across orders.

Query:

SELECT c.tier, COUNT(*), SUM(oi.quantity * oi.unit_price)
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE c.tier = 'enterprise'
GROUP BY c.tier

Benchmark this on both Postgres and SQL Server at 50K customers, 500K orders, 900K items. Compare the estimated vs actual rows at every plan node. I want to understand how the planners handle skewed FK fan-out.

That's the whole input. Three table declarations, one query, one sentence describing the skew, one row-count target. No INSERT scripts, no choice of indexes, no DDL for either engine, no plan-hint guidance, no "try a hash join first." Everything else, generating 500,000 orders at exactly an 80/20 fan-out, picking indexes, materializing the schema on Postgres and SQL Server, capturing the plans, was filled in by the AI calling ExoBench behind the scenes.

Nothing exotic about the query. Three tables, two joins, one GROUP BY. The kind of query you'd write without thinking and expect to be boring. On uniform test data it runs in about a millisecond, so you ship it.

Then it runs ~1,000x slower in production, because production has whales. Those 10 enterprise customers own 80% of the orders and the other 49,990 customers split the remaining 20%. That single fact, that a handful of rows in customers get joined-to forty thousand times each while everyone else is joined-to a handful of times, is one of the hardest things in all of database performance for a planner to get right. Every database hates this. And here's the part that should bother you: each one hates it differently, so the fix that rescues one engine does literally nothing on the other.

On Postgres, the whale query runs at 1,572ms and one free session setting drops it to ~310ms, a 5x win with zero schema change. On SQL Server, that exact setting, and a dozen of its cousins, can't move the needle off a 2,000ms floor at all. SQL Server needs a totally separate fix involving three different levers stacked together to reach ~330ms. The AI confidently predicted symmetry, one fix for both engines. The AI was wrong. I was wrong! ExoBench sorted out both of our egos!

(Already sold and want to run this on your own skewed table? Get started here. Otherwise, here's what I found.)

I didn't write a line of benchmark code for any of this. ExoBench is an MCP server I built which spins up real Postgres and Azure SQL instances, generates the synthetic data at the distribution you ask for, runs the query, and hands back the actual execution plans. Every number below is QueryTimeStats.ElapsedTime from the SQL Server plan XML or the root-operator actual time from a Postgres EXPLAIN ANALYZE. Nothing here is guessed from documentation. I'll quote my follow-up prompts inline, in boxes like the one above. they are verbatim, typos preserved, so you can see the exact shape of each request that drove the next run.

First, the control: what if the data were uniform?

Before blaming skew, you have to rule out the query and the schema. So the first thing I did after seeing the baseline numbers was run the boring counterfactual on my own. take the same schema, same indexes, same row counts, but generate orders.customer_id uniformly across all 50,000 customers instead of piling 80% onto ten of them. Then run the identical query. Same ExoBench call, just a different data generator clause.

Distribution	Enterprise share of orders	Rows through top join	Postgres	SQL Server
Uniform (naive test data)	~0.02%	~178	1.4 ms	1 ms
Skewed (production reality)	80%	~720,000	1,572 ms	2,340 ms

Read that top row. With uniform data the query is instant on both engines, the plan is a clean cascading nested loop, and the optimizer's cardinality estimate is accurate to within ~2x. Everything looks fine. If you benchmarked this on Faker-generated test data, you'd ship the nested-loop plan, close the ticket, and never know there was a problem. The naive benchmark doesn't just miss the issue, it produces a confident all-clear and then production runs the identical plan at 2,340ms.

The plan is the same. The query is the same. The indexes are the same. The row counts in every table are the same. The only thing that changed is which customers the orders point at, and that single change inverts the conclusion by a factor of over a thousand. This is why test-data distribution is not a detail you can hand-wave. It's the whole experiment.

So the skew is the villain. Confirmed. Now, why does it wreck both planners, and why does the cure split?

Why both planners walk into the same wall

Here's the diagnosis, and it's the same on both engines.

Both planners look at the histogram on orders.customer_id and see 500,000 orders spread across ~50,000 distinct customers. Average: ~10 orders per customer. So when the query feeds in the 10 enterprise customer IDs, the planner reasons "10 customers times ~10-15 orders each, call it 150 rows" and picks a cascading nested loop, which is genuinely the right plan for 150 rows.

Reality: those 10 specific customer IDs aren't average. They own 400,000 orders between them, ~40,000 each. The planner used the global average density where the actual joining keys sit in the extreme tail of the distribution.

Plan node	Postgres est → actual	SQL Server est → actual
Customer filter (`tier = 'enterprise'`)	13 → 10 (fine)	10 → 10 (exact)
Orders join (per enterprise customer)	75 → 39,993	16.5 → 399,669
Top join into `order_items`	234 → 720,127 (~3,000x under)	298 → 719,417 (~2,400x under)

The cardinality-estimate gap widens with scale

Top-join row count at each scale point. Actual rows track the workload; estimated rows stay flat. Log scale.

Postgres

Actual rows Estimated rows

SQL Server

Actual rows Estimated rows

Same skew (80% on 10 customers), three scale points, two engines. The red lines track the workload, growing 10× as ORDERS grows 10×. The gray lines are what each planner thought would happen, and they barely move. On Postgres the estimate actually shrinks at the largest scale (18 vs 144 at the middle scale), because more customers in the histogram dilutes the per-key density average further. The gap is not a fixed bug — it grows monotonically with dataset size.

The leaf scans are accurate. The joins explode. That signature, accurate leaves and exploding joins, is the fingerprint of join-crossing correlation, and it's distinct from a stale-statistics problem where the leaf would be wrong too. Neither engine's statistics model the correlation between a customer's tier and its order count, because that correlation lives across a join boundary. A single-column histogram on orders.customer_id is averaged over all 50,000 customers, so the 10 whales get smeared into the same per-key density as the 49,990 minnows. The estimator isn't buggy. It's faithfully extrapolating from a statistic that physically cannot express the thing you need it to know.

This is not a tuning problem. It's structural, and it's present in essentially every cost-based optimizer that relies on per-table column statistics. Postgres extended statistics (CREATE STATISTICS) won't save you because the correlation is across the join, not within one table. SQL Server filtered statistics have the same ceiling. Both engines miss by roughly the same 2,400–3,000x and pick the same fragile plan.

So far, perfectly symmetric. Same query, same data, same blind spot, same wrong plan, nearly the same slowdown.

Postgres: one free line, 5x faster

One more thing before we go further. The AI is about to confidently predict the wrong outcome, and the only reason you'll see the correction is that a real engine produced it. That's the difference between this and pasting your query into a chatbot: a chatbot predicts what the planner will do, ExoBench shows you what it actually did. Guessing versus measuring. If you want the full architecture, here's how it works, and here's how it compares to chatbots and other tools.

After the baseline runs above, the AI gave me a tidy diagnosis: same blind spot in both planners, same fix on both. force a hash join on each engine and the nested-loop catastrophe goes away. It offered to run the counterfactual and prove it. I didn't buy it. You can't just poke sanctioned join-plans with a stick and get real improvements... can you!?

I find this hard to believe! Run the couterfactual to confirm what you're saying is actually true! (BOTH Postgres and SQL Server)

On Postgres, the AI was right, and it's decisive:

Postgres	Nested loop (default)	Hash join (`SET enable_nestloop = off`)
Execution time	1,572 ms	309 ms
Buffer hits	~1,945,000	~8,500
Speedup	baseline	5.1x

One session-level setting. No schema change, no new index, no rewrite. The planner still estimated ~90 rows out of a join that produced 720,000, so the estimate was just as wrong, but the hash plan doesn't care. It scans order_items once sequentially across a few parallel workers and builds a hash table, instead of doing 720,000 random index probes. The nested-loop plan's cost scales with the number of probes, so a 3,000x underestimate is catastrophic. The hash plan's cost is dominated by a single scan, so the outrageous row-estimate doesn't do anything. Buffer traffic collapsed 230x. The lever was the join strategy, and forbidding the bad algorithm cost exactly one line.

Postgres has seven index types and you didn't need any of them here. You needed to revoke one bad decision.

Okay, neat. Same fix on SQL Server and we're done, right?

SQL Server: The Quagmire where every Hint Hits a Wall

No. SQL Server did the opposite of what the AI predicted, my skepticism turned out to be earned. I'm going to summarize what happened because if I don't, it will literally hit somebody's character limit!

SQL Server	Nested loop (default)	Hash join (forced, naive)
Elapsed time	2,340 ms	5,266 ms
Result	the floor	2.25x slower

The forced hash plan was more than twice as bad as the original disaster. Two things sank it. First, the hash join spilled to tempdb, because SQL Server granted the build only ~4.3MB of memory, sized off that same 179-row underestimate, while the build side actually carried ~400K rows. The bad estimate didn't just mislead the plan choice, it poisoned the memory grant. Second, the forced hash plan still contained the 720K-iteration key lookup anyway, because the hint forced the joins to hash but the optimizer still chose to fetch quantity and unit_price through a per-row lookup off the non-covering index. The hint forced away the nested-loop joins and left the single most expensive nested-loop-shaped operation in place, then stacked a tempdb spill on top. Worst of both worlds.

The AI's mental model was Postgres-shaped and it had jammed that shape onto a different engine. Now I had my proof: the diagnosis was wrong, at least for SQL Server. So we went scorched-earth. Across roughly eight rounds I threw the entire no-index hint arsenal at it, and the prompts got progressively less polite. Verbatim:

Hold on! You're saying the only way to fix it in SQL Server is a giant covering index? Can't you just force the right plan like you did with Postgres?

That's not good enough! Try the OTHER hints!

Just try ALL of them and see what works!

Is there NOTHING else to do for SQL Server other than a covering index???

Try Batch mode on rowstore

What can I try WITHOUT CHANGING INDEX STRUCTURE??

Try: Rewrite the query to pre-aggregate

Try FORCE_LEGACY_CARDINALITY_ESTIMATION / OPTION (FORCE ORDER) / OPTION (HASH JOIN, MERGE JOIN). Leave no stone unturned!!!!

This is the part that would have eaten three days by hand: spin up a 1.4-million-row dataset, build the indexes, run the query, capture the plan XML, read it, form the next hypothesis, tear it down, rebuild. Per attempt. Times ten. Through ExoBench it was ten prompts in an afternoon, each one a real Azure SQL instance with a real plan.

Lever tried	Elapsed	Verdict
Baseline (nested loop)	2,317 ms	the floor to beat
Forced HASH JOIN (naive)	~5,266 ms	worse, spilled to tempdb
HASH JOIN + MIN_GRANT_PERCENT=25	2,121 ms	best hint, ~8%
MERGE JOIN	2,099 ms	tie, swapped lookup for a spilling sort
DISABLE_OPTIMIZER_ROWGOAL	2,317 ms	identical to baseline
ALLOW_BATCH_MODE + HASH + MIN_GRANT	2,007 ms	best no-index result, still > 2,000
FORCE_LEGACY_CARDINALITY_ESTIMATION	2,327 ms	lookup remains
FORCE ORDER	2,316 ms	lookup remains
Pre-aggregation rewrite	2,651 ms	worse, optimizer timed out, went serial

SQL Server: every no-index hint hits the same wall

Elapsed time (ms) per attempt. Red line marks the 2,000ms floor no hint could cross.

The entire no-index option space spans 2,007ms to 2,651ms. The best result that left the access path alone was 2,007ms, a 13% improvement that still couldn't cross the 2,000ms line. Every single lever hit the same wall, because every one of them still had to perform the same ~720,000-row, ~2-million-logical-read fetch to retrieve quantity and unit_price. No join hint, no estimator override, no join order, no memory grant tweak, no batch-mode toggle, and no query rewrite removes that fetch. It's baked into the access path.

That's the asymmetry in one sentence: Postgres's bottleneck was a join-algorithm choice you can revoke for free, and SQL Server's was an access-path choice no hint can reach. Same skew. Same bad estimate. Completely different binding constraint.

SQL Server: the breakthrough needs three levers at once

After Prompt 10 we had exhausted everything that left the access path alone. So I gave up on the hint arsenal and let the optimizer have what it had been quietly asking for the whole time:

Okay. Let's use the covering indexes and see where we end up.

The fix is one index. Carry the two summed columns on the index leaf so the key lookup simply doesn't exist:

CREATE INDEX idx_items_order_covering
  ON order_items(order_id) INCLUDE (quantity, unit_price);

No join hints. Let the planner choose freely.

SQL Server approach	Elapsed	What dominates
Baseline (nested loop)	2,340 ms	the 720S key lookup
Best no-index hint (batch + grant hash)	2,007 ms	still the 720S key lookup
Covering index alone (serial nested loop)	1,153 ms	serial nested-loop probes
Covering + batch mode + hash join	330 ms	nothing, single covered scan + batch hash, ~3K reads

The covering index alone got it to 1,153ms, half the baseline, by killing the lookup. But it left the engine on a serial nested-loop plan doing 1.2 million reads through the index. The real result needed all three levers stacked:

... OPTION (USE HINT('ALLOW_BATCH_MODE'), HASH JOIN, MIN_GRANT_PERCENT = 15)

That's the interesting part of the SQL Server story, and it's a genuine contrast with Postgres. Postgres needed exactly one lever and the second-best fix (the covering index) was worse for it. SQL Server needed three levers that unlock each other: the covering index makes a batch-mode scan viable, batch mode makes the covering index pay off fully, and the hash join consumes the result without a per-row probe. Neither alone got there. Covering-alone stalled at 1,153ms on a serial plan. Batch-mode-alone stalled at ~2,041ms on the lookup. Together they collapse the ~2-million-read fetch into a single covered batch-mode scan at ~3,022 reads, a 685x reduction in logical reads, and the whole query lands at 330ms. Without ExoBench, I shudder at how much time it would have taken to reach this absurdly specific combination of levers.

So the two engines reach roughly the same finish line, ~310ms versus ~330ms, by entirely different routes. Postgres: change the join strategy, one free setting. SQL Server: change the access path, an index plus two hints working in concert. The thing that helped SQL Server most (the covering index) was the weakest fix for Postgres, which only got 1.25x from it because the planner stubbornly kept the nested loop. "Bad estimate" does not imply one universal fix. The right lever depends on the engine's default plan shape and what its alternative actually costs.

Even the "proper" fix Does Not Transfer

There's a textbook answer to skew that I haven't mentioned yet, and a good engineer is already shouting it: stop lying to the optimizer. Don't hint, don't index around the problem, fix the cause. Split the high-volume rows into their own table so the statistics are honest. The enterprise customers get enterprise_orders, everyone else gets regular_orders, and now the histogram on enterprise_orders.customer_id describes only the ten whales, so its average really is ~40,000 orders per customer instead of being drowned by 49,990 minnows.

This is the deepest and most correct fix in the whole investigation. It attacks the cause instead of a symptom. And it also fails to transfer cleanly, which is the point I want to leave you with.

So I ran it. Both engines. Split orders into enterprise_orders and regular_orders so the histogram on each table describes one population, then UNION ALL the two halves back for the full result. Same query semantics, honest statistics. Is it actually better than the hints, or just different?

One thing to get out of the way first: splitting customers alone does nothing on either engine (2,302ms SQL Server, 1,613ms Postgres). The skew never lived in customers. The WHERE tier = 'enterprise' filter already estimated those 10 rows correctly in the baseline. The skew lives in the fan-out of orders.customer_id, so only splitting orders can fix it. Aim at the table where the disease actually is, not the one it appears to describe.

Splitting orders works, but here is the full scoreboard, and the two columns do not agree:

Approach	SQL Server	Postgres	Schema change?
Baseline (single tables, skew)	2,340 ms	1,572 ms	—
Free one-liner (`enable_nestloop = off`)	2,041 ms (floor)	321 ms	none
Best access-path fix (covering + batch + hash)	330 ms	n/a	covering index
The "Proper Solution" (split tables + UNION ALL)	193 ms	391 ms	two tables + union

Same four fixes, opposite winners

Elapsed time (ms) for each fix, on each engine. Best result per engine in green.

Postgres

SQL Server

The rank order of these four bars is different on each engine. On Postgres the free one-liner wins by 70ms over the textbook table split. On SQL Server the table split wins by 137ms over the best access-path fix, and the free one-liner doesn't even crack the 2,000ms floor.

Read each column separately, because forcing them into one verdict is exactly the mistake.

On SQL Server the split wins decisively. 193ms, faster than the 330ms covering-index result, and it needs no covering index. The reason is the same one that defined the SQL Server quagmire: its bottleneck was the 720K-row key lookup, and honest per-table statistics let the optimizer cost a plain batch-mode scan correctly and drop the lookup entirely. When the top-join estimate goes from 3,000x wrong to 900,001-vs-900,000 (essentially perfect), the engine derives a beautiful parallel batch-mode hash plan on its own, with no hints at all.

On Postgres the split loses (by a bit). The fair UNION ALL reconstruction is 391ms, which is 22% slower than the free SET enable_nestloop = off (321ms) and it costs you a full schema redesign: two order tables, a UNION ALL on every cross-tier query, write routing for new orders, and physically moving rows when a customer changes tier. Postgres never had the deep problem. Its only mistake was picking a nested loop over a hash join, and you can revoke that for free. Reaching for a table split here is cargo-culting a SQL-Server-shaped solution onto an engine that didn't need it.

The mechanism is identical on both engines. Honest statistics produce an honest estimate, and an honest estimate produces a good plan. (And the UNION ALL doesn't reintroduce the blowup, because the optimizer estimates each branch separately and both branches now have clean single-class statistics.) But "fix the cause" is only the best fix when the bad estimate was the binding constraint. On SQL Server it was, so the cause-fix is supreme. On Postgres the bad estimate only caused a recoverable algorithm choice, so the cheap symptom-fix dominates the expensive cause-fix. Same skew, same correct theory, opposite economics.

The honest wall: where this stops

Skew is genuinely one of the hardest things a database has to get right, and I'd rather you know the edges of what I measured than discover them in production.

Synthetic data is not your data. ExoBench generated exactly 10 enterprise customers owning exactly 80% of orders. If your real skew is 95% on 3 accounts, or 60% on 200, the multipliers move and possibly the winning plan with them. The shape of the asymmetry (Postgres wants a join-strategy fix, SQL Server wants an access-path fix) is robust, but the exact crossover is mine, not yours. Hand the AI your real distribution and re-run it.
The AI fills in everything you didn't specify, and it can fill in wrong. I gave it the schema and the 80/20 skew. it picked the row counts, the indexes, the data generators, the distributions for name and unit_price, the order of INSERT statements, and which EXPLAIN mode to use. Always ask it "show me the schema and the data shape you actually used." If the shape is wrong, the benchmark is confidently wrong, and a confidently wrong benchmark is worse than no benchmark.
These are two different managed platforms, not one box. Postgres ran on Neon, SQL Server on Azure SQL Database, comparable hardware but not identical. So read the within-engine before-and-after as solid, and treat the cross-engine absolute times as a reference point, not a head-to-head verdict. The asymmetry in which fix works is the finding, not "engine A beats engine B by Xms."
It's probabilistic and it optimizes what you give it. Two runs may explore different hint orders. And if the real answer is "this query shouldn't run on every page load," ExoBench will happily handraulic it down to 330ms and never tell you to cache it instead. It measures the query you brought. It doesn't ask whether the query should exist.
It doesn't fix infrastructure. Memory grants, connection pools, concurrent load, network latency, deployment topology. ExoBench measures the query and the plan, not your cluster.

What it does fix is the part where you take the optimization advice that rescued your Postgres database, apply it to your SQL Server database, and quietly get nothing, because the two engines failed for the same reason and recover for completely different ones, and nobody told you that because nobody measured both.

Try this on your own skewed table

Find your candidate. The aggregate over a tiered or fanned-out table that's slower than it has any right to be:

-- Postgres
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 5;

If the top result joins across a table where a few keys dominate the volume, customers and orders, accounts and events, tenants and rows, it's a skew question, and now you can settle it on each engine instead of guessing on both.

This is the exact question I'd point at your table. Head to exobench.ai to connect it to whatever you already talk to, Claude, Cursor, whatever, and paste a prompt shaped like the ones I used:

Here's a slow query and its schema [paste yours]. Around 20% of the keys own 80% of the volume. Use ExoBench to benchmark it at scale on both Postgres and SQL Server, show me the estimated-vs-actual rows at every join node, then find the fastest fix on each engine separately. They could be different.

That last sentence is the one the folklore can't answer and you now can. The AI translates the whole thing into ExoBench tool calls, spins up both engines, and you read the plans side by side. Get started with ExoBench and your first benchmarks are free. No preview application, no waiting for approval.

That whale query that's been quietly costing you two seconds and a different headache on each database? You don't have to guess which lever fixes which engine. You can know in about 90 seconds, twice.

Appendix: The full per-engine scoreboards

Every figure below is QueryTimeStats.ElapsedTime from the SQL Server plan XML or the root-operator actual time from a Postgres EXPLAIN ANALYZE. runtimeDurationMs (which includes compile and network) is not used for any comparison. All runs at 50,000 customers, 500,000 orders (400K enterprise / 100K regular), 900,000 order_items, executed live via ExoBench on Neon (Postgres) and Azure SQL Database (SQL Server).

Postgres

Approach	Actual time	vs baseline	What the planner did
Baseline (nested loop)	1,572 ms	—	cascading nested loops, ~1.94M buffer hits, top join est 234 vs actual 720,127
Covering index `INCLUDE`	1,261 ms	1.25x	kept the nested loop, made each probe an Index-Only Scan (`Heap Fetches: 0`)
Forced hash (`enable_nestloop=off`)	309–321 ms	~5.1x	single hash build over `order_items`, one scan, no per-row probe
Split orders, full UNION ALL	391 ms	~4x	Parallel Append + Parallel Hash Join, one scan of `order_items`

The Postgres lesson: the winning lever was the join strategy, not the access path. The covering index, which was the breakthrough for SQL Server, was the weakest fix here, and the structural split is dominated by a free one-line setting.

SQL Server

Approach	Elapsed	What dominates
Baseline (nested loop)	2,340 ms	the 720K-row key lookup, ~2.07M logical reads
Best no-index hint (ALLOW_BATCH_MODE + HASH + MIN_GRANT)	2,007 ms	the same key lookup
Columnstore, no hints	1,479 ms	scan cheap (415ms CPU), top hash spilled to tempdb
Covering index alone (serial nested loop)	1,153 ms	serial nested-loop probes, 1.2M reads
Covering + batch mode + hash	330 ms	nothing, single covered batch scan (~3,022 reads) + batch hash
Split orders, full UNION ALL	193 ms	nothing, batch-mode Concatenation + Hash Match, no lookup, honest estimate

The SQL Server lesson: the bottleneck was always the access-path fetch, never the join algorithm. Only a structural change (covering index or honest statistics) removed it. The covering index needs batch mode and a hash join stacked on top of it to reach its best result, and even then the honest-statistics table split beats it by 1.7x.

Appendix: Why the uniform-data control matters

The naive uniform run is the single most important control in this whole investigation, because it proves the skew (not the query, not the schema, not the indexes) causes the catastrophe.

Metric	Uniform	Skewed	Ratio
Enterprise share of orders	~0.02%	80%	—
Rows through top join	~178	~720,127	~4,000x
Postgres exec time	1.4 ms	1,572 ms	~1,120x
SQL Server elapsed	1 ms	2,340 ms	~2,340x
SQL Server fetch logical reads	457	~2,071,690	~4,500x
Top-join estimate error	~2x	~3,000x	—

Same plan, same query, same indexes, same row counts in every table. The only difference is which customers the orders point at. Under uniform data the optimizer's estimate is correct (~2x), so nothing looks wrong and the nested-loop plan is genuinely optimal for the 178 rows it produces. The plan choice isn't a bug. It's the right plan for the data the planner thinks exists. Skew is what turns the right-looking plan into a multi-second disaster, and a benchmark built on Faker-style uniform data would have told you everything was fine while actively hiding the problem and making every fix in this post look pointless. That's the trap, and it's why ExoBench generates data at the distribution you specify rather than the distribution that happens to be easy.