The Curious Columnstore of SQL Server

Clustered columnstore beats rowstore on inserts and bulk updates. The penalty is dead-row bloat that wrecks reads until you REORGANIZE.

May 21, 2026

Summary

SQL Server clustered columnstore does not have bad write performance, measured on Azure SQL Database using ExoBench. Bulk inserting 1M rows ran faster on columnstore than rowstore (3.3s vs 6.1s). Trickle-inserting 1,000 single rows tied (944ms vs 1038ms). A single bulk UPDATE of 100K rows also beat rowstore. The real penalty is deferred: UPDATEs and DELETEs leave dead rows that disable batch-mode aggregate pushdown, turning a sub-millisecond GROUP BY into 768ms until you run ALTER INDEX ... REORGANIZE. Measured at 1M-3M rows on Azure SQL Database.

ExoBench Speaks SQL Server

ExoBench can speak SQL Server now. It already measured Postgres, as of this week it measures the DB your enterprise actually runs.

I never wrote a line of benchmark code for this post. I talked to my AI assistant, it called ExoBench behind the scenes, and ExoBench spun up real Azure SQL instances and measured. So the honest way to show you what this does is to show you the prompts. I'll quote my actual asks inline, in boxes like the ones below, so you can see the exact shape of the request that drove each piece.

And the first thing I asked broke my brain a little. Not with some exotic query. with the most boring query in the world.

*I have an SQL Server table that looks like this:*

SELECT category_id, COUNT(*), AVG(amount), SUM(amount), MAX(amount) FROM sales GROUP BY category_id;

The schema looks roughly like this:

CREATE TABLE sales (
    id          BIGINT IDENTITY(1,1) PRIMARY KEY,
    category_id INT NOT NULL,
    amount      DECIMAL(12,2) NOT NULL,
    created_at  DATETIME2 NOT NULL
);

Use ExoBench to figure out why my query is slow and optimize it.

That's it. That's the whole ask. A GROUP BY. count, average, sum, max, grouped by category, over a sales table. The kind of query every billing system on earth has a version of.

Fifty categories, three million rows. On a normal table with a normal clustered primary key, SQL Server scans the whole thing and aggregates it. ExoBench measured the read at 433ms. Fine. Expected. That's a full scan of a wide table feeding a hash aggregate, and there's no WHERE clause, so every row gets touched no matter what you do.

Then, on its own, the AI tried the thing the query was begging for. it put a clustered columnstore index on the table and ran the exact same query again.

Rows	Rowstore (clustered PK)	Clustered columnstore	Postgres
1M	~50 ms	sub-millisecond	109 ms
3M	433 ms	~14 ms	312 ms

Wait, what?

That's a 30x gap at 3M rows, and it gets wider as the table grows. The rowstore read scales with the data. Postgres, doing its level best with a parallel scan, scales with it too. The columnstore read barely moves. Same query. Same data shape. The columnstore is in a different league, and it isn't close.

And notice I've got Postgres numbers sitting right there too. that took exactly one prompt:

*That's a big difference. How does Postgres do for this kind of query and index?*

That's the whole point of measuring through an AI that can call ExoBench. one sentence and it spun up a Postgres instance, ran the same query shape, and handed back numbers. Postgres has no columnstore cell because it has no columnstore. it's pure rowstore in core, so that single column is the only story it has to tell here. (Small print: near-identical data, different hardware, Neon vs Azure SQL. A reference point, not a head-to-head. For the exact differences, see Appendix: Postgres vs SQL Server Reference Data.)

Here's the query plan that produced the 14ms, the part that matters:

Clustered Index Scan (ColumnStore)
  ActualLocallyAggregatedRows = 3000000
  SegmentReads = 1, SegmentSkips = 0
  ActualExecutionMode = Batch

Read that middle line again. ActualLocallyAggregatedRows = 3000000. All three million rows were aggregated inside the scan operator, in batch mode, before a single row bubbled up to the aggregate above it. The aggregate node got handed a nearly-finished answer. The work happened during the read. That's the whole trick, and it's why the time barely changes between 1M and 3M rows.

A quick note on the numbers in this post. Every one of them came out of ExoBench: I described a schema, it spun up a real Azure SQL instance, generated synthetic data at the scale I asked for, ran the query, and handed back the actual execution plan. The timings are pulled from the plan XML (QueryTimeStats ElapsedTime), not wall-clock, so compile time and network don't pollute them. Nothing here is from documentation. If a number looks surprising, it's because I ran it, not because I read it somewhere.

So, briefly, what is this thing. A columnstore index stores a table column-by-column in compressed segments instead of row-by-row in pages. For an analytic query that reads two columns out of four and aggregates millions of rows, that's exactly the right shape: SQL Server reads only the columns you touched, decompresses them in bulk, and crunches them in batch mode. In SQL Server there are two flavors. A clustered columnstore is the table. the whole thing is stored columnar, no separate heap. A nonclustered columnstore sits alongside a normal rowstore table as a secondary index, so you keep your B-tree for point lookups and add columnar storage for analytics on top. Same engine underneath. The clustered version is the one people have opinions about.

If you've never internalized the row-vs-column thing and you want it in plain English with diagrams that actually help, Nikola Ilic's Rows or Columns. where should I put my index on? is the best explainer I've found. it's SQL-Server-specific, walks through real queries on a real table, and shows you the memory-footprint and segment-skipping diagrams that make the whole thing click.

Here's my one-line version, and I'll own it as an opinion: columnstore is the one that actually scales. When you think columnstore, think Snowflake, Redshift, Parquet, Delta. all the things built to chew through a billion rows. When you think rowstore, think Ye Olde Database, perfect for fetching one customer's order and quietly miserable at aggregating a hundred million of them. I'm sure some RDBMS priest is already drafting an email with three academic papers proving the distinction is more nuanced than that. It is more nuanced than that. I'm still not budging.

The thing people say without measuring anything

Here's what you hear the moment you mention clustered columnstore: "Yeah, but the write performance is horrible."

Okay. At what scale? Inserting how many rows? In what shape, bulk load or single-row trickle? Updating or just inserting? Concurrent or serial? On what hardware? A statement about performance without these facts is no statement at all!

ExoBench's whole reason for existence is to give you an answer to these questions for a near-zero amount of effort. ExoBench shows you the planner's behavior changes at every scale, because this is not something you can reason about it by just staring at the query. So let's point it at the folklore and see where, if anywhere, the folklore holds up.

What ExoBench does

Here's the setup. ExoBench is an MCP server (MCP is just a protocol that lets AI tools call external servers), and your AI assistant is the client. So you describe your problem in plain English. "this query is slow, figure out why" and the AI does all the API wrangling for you. It writes the schema, picks the scale points, fires the benchmark, reads the execution plan that comes back, forms a hypothesis, and runs the next benchmark to test it. You never touch the tool calls. You watch the AI drive and read the results.

ExoBench then spins up a real, ephemeral database instance. For SQL Server that's Azure, for Postgres it's AWS. It generates synthetic data at the scale points you specify, runs your actual query, captures the real execution plan, and returns it. It does not connect to your database. Your production data never leaves your systems. You hand it a schema and a shape, it builds a disposable world matching that shape and measures what happens in it.

The difference between this and pasting your query into a chatbot is the difference between guessing and measuring. A chatbot reads your SQL and predicts what the planner will do. ExoBench runs the SQL on a real engine and shows you what the planner actually did. The chatbot cannot close the loop between "I think this will help" and "here is proof that it did." ExoBench closes the loop, and now it closes it on SQL Server too.

If you want the full architecture, here's how it works. And here's how it compares to chatbots and other tools, the other ways people try to do this.

First stop: the inserts, where the folklore is just wrong

So I pushed back at it with the thing everyone says:

Big difference. But I've heard clustered columnstore has horrible write performance. Use ExoBench to actually test that. inserts and updates, bulk and trickle, against a rowstore baseline. I want to know where the folklore is right and where it's wrong.

That one prompt drove everything from here down. Start with the simplest write there is. Bulk-load a million rows into each storage type, time it server-side with SET STATISTICS TIME ON, and compare. Here's the schema ExoBench built. ordinary sales table, the kind every billing system on earth has a version of.

CREATE TABLE sales (
    id          BIGINT IDENTITY(1,1) PRIMARY KEY,
    category_id INT NOT NULL,
    amount      DECIMAL(12,2) NOT NULL,
    created_at  DATETIME2 NOT NULL
);
-- example row:
-- id: 81923, category_id: 37, amount: 412.55, created_at: 2026-04-30 09:14:22

Bulk insert of 1M rows:

Storage	Bulk insert (1M rows)
Rowstore (clustered PK)	6083 ms
Clustered columnstore	3339 ms

The columnstore was nearly twice as fast. Not slower. Faster. Because a bulk insert above 102,400 rows skips the staging buffer entirely and compresses straight into columnar segments, and it writes far less to the transaction log doing it. The folklore predicted a bloodbath. The measurement says the opposite, in the exact case ("loading a lot of data") where people most confidently expect columnstore to fall apart.

Okay, but bulk is the easy case for columnar. The folklore is really about the small writes, the death-by-a-thousand-cuts trickle of single-row inserts. So I ran 1,000 separate single-row inserts into each:

Storage	1,000 single-row inserts
Rowstore (clustered PK)	1038 ms
Clustered columnstore	944 ms

Tied. If anything columnstore edged it, but that's noise. No penalty.

Why no penalty? Because those 1,000 rows never became columnar at all. They landed in the delta store. SQL Server's name for the little rowstore staging area that catches small writes before they're big enough to compress. The rest of us would have called it "the inbox," the place new rows sit until there are enough of them to bother packing. Below 102,400 rows it's literally a B-tree, the same structure rowstore uses. So of course the timings match: at this scale, the two are doing nearly identical physical work. The folklore measured nothing, and the thing it imagined doesn't even kick in until you're a hundred times past where most people think it does.

Okay, that was meaningful. But inserts are the friendly case. Let's see what happens when we update.

Second stop: the updates, where I had to throw out my first attempt

After a few false starts trying to make a loop do 10,000 auto-committed updates until the sun burns out, my AI agent realized one clean set-based UPDATE was the way to go. The loop wasn't measuring the storage engine, it was measuring 10,000 separate commits to a throttled transaction log. A single statement isolates the thing we actually care about. one statement, one transaction, touching 100,000 rows (every 10th row of a million).

UPDATE sales SET amount = amount + 1
WHERE id % 10 = 0;   /* 100,000 rows, single statement */

Storage	UPDATE 100K rows (elapsed)
Rowstore (clustered PK)	2587 ms
Clustered columnstore	989 ms

Columnstore won again. (Honest caveat: the rowstore run ate a pile of parallelism-coordination wait that inflated its number, so treat this as "columnstore was not slower" rather than a hard 2.6x. The direction is solid; the multiple is soft.)

The plan shows how columnstore does an update, and it's the key to the whole story. The update operator carries a Split. SQL Server turns every UPDATE into a delete-plus-insert: it marks the old row dead in a deletion bitmap and writes a fresh copy into the delta store. It never reclaims the dead row's space at update time. Hold onto that, because it's where the folklore finally finds its grain of truth.

Where the folklore is actually right: the dead rows come due

The dead rows come due

The write itself is fine. We've now measured that four different ways. So where does "columnstore is bad at writes" come from? It comes from what the writes leave behind, and it shows up not on the write but on the next read. That Split operator was the clue, so I chased it:

The write is fine but those dead rows have to go somewhere. Attack the bloat case. churn the table with repeated updates, then re-run my GROUP BY read and show me what the dead rows do to it.

Watch. Take the clean columnstore table. read the GROUP BY, it's sub-millisecond, we established that. Now churn it. ten full-table updates, every row rewritten ten times over:

UPDATE sales SET amount = amount + 1;
UPDATE sales SET amount = amount + 1;
-- ...eight more times. 10 passes, 10M row-versions touched.

That's the whole reproduction. No exotic workload, no pathological data. just an UPDATE with no WHERE, run ten times, the kind of thing a nightly batch job does without anyone thinking twice. Then run the exact same read again. Here's what ExoBench found in the row-group physical stats after the churn:

SELECT SUM(total_rows), SUM(deleted_rows), COUNT(*) AS row_groups
FROM sys.dm_db_column_store_row_group_physical_stats ...
-- total_rows: 2,000,000   deleted_rows: 1,000,000   row_groups: 2

The table holds 1M live rows. It's physically storing 2M, half of them dead. tombstoned in the deletion bitmap, not reclaimed. (Note it's 2M, not 11M. the engine consolidates as it goes, so the bloat plateaus rather than growing without bound. The folklore's "it just keeps getting worse forever" is also wrong. it's self-limiting.)

And now the same read that was sub-millisecond on the clean table:

SELECT category_id, COUNT(*), AVG(amount), SUM(amount), MAX(amount)
FROM sales GROUP BY category_id;   /* 768 ms */

768ms. From sub-millisecond to 768ms, on a query we didn't touch, because of writes we did earlier.

Wait, what. Why would 50% dead rows cause a thousand-fold read slowdown rather than a 2x one? The plan tells you, and it's brutal:

Clean:    Clustered Index Scan (ColumnStore)
            ActualLocallyAggregatedRows = 1000000   EstimateIO = 0.003

Churned:  Clustered Index Scan (ColumnStore)
            ActualLocallyAggregatedRows = (gone)     EstimateIO = 1.716
            Batches = 1112  →  separate Hash Match aggregate

ActualLocallyAggregatedRows is gone in the churned plan. The deletion bitmap disabled aggregate pushdown entirely. The query reverted from "aggregate during the scan" to "scan everything, filter the dead rows through the bitmap, then aggregate separately." It didn't get 2x more work. It lost the optimization that made it fast in the first place. The penalty is a plan-shape collapse, not a linear slowdown. That's why it's nonlinear and nasty.

This is the real thing the folklore is groping at, and notice how badly it mangled it. The damage isn't slow writes. The writes were fine. The damage is dead rows that accumulate from UPDATEs and DELETEs and quietly wreck your reads until you do maintenance. The cost is real, it's just been hung on the wrong operation for years.

The full, honest picture across the whole investigation

Operation	Columnstore vs rowstore	Verdict
Analytic read, clean table	~30x faster	Columnstore wins big
Bulk insert (1M rows)	Faster (3.3s vs 6.1s)	Folklore wrong
Trickle insert (1,000 rows)	Tied (~944 vs 1038ms)	Folklore wrong
Bulk UPDATE (100K rows, one statement)	Faster (soft multiple)	Folklore wrong
Analytic read after update churn	sub-ms → 768ms	Folklore right, finally

Five tests. The conventional wisdom got four of them backwards and stated the fifth so imprecisely that it pointed you at the wrong operation. If you had made an architecture decision based on "clustered columnstore has bad write performance", you'd have ruled out columnstore for an append-heavy analytics table that would have run 30x faster on it, to avoid a write penalty that doesn't exist, while never being warned about the update-churn penalty that actually does!

This is the entire reason ExoBench exists. Not because optimization is hard. because the received wisdom about optimization is mostly unmeasured, frequently backwards, and stated without the one thing that would make it actionable: a scale, a shape, and an operation. "Clustered columnstore has horrible write performance" should have been "clustered columnstore makes you schedule a REORGANIZE if you update or delete a lot, or your reads rot." Same concern. Completely different sentence. One of them you can act on.

You don't need to take my word for any of this. That's the whole idea. Run it yourself.

(Note, if you want to see what happens once REORGANIZE see the first Appendix: How to fix it with REORGANIZE)

The honest wall: where this stops working

Measuring SQL Server is new, and I'd rather you know the edges than discover them.

Synthetic data is not your data. ExoBench generated 50 evenly-distributed categories with random amounts. If your real category_id is wildly skewed, or you have 5 million distinct categories instead of 50, the read numbers shift. The columnstore advantage holds, but the exact figures are mine, not yours. Hand the AI your real distribution and re-run.
The AI infers the schema and the data shape. It can be wrong. Always ask it: "show me the schema and the data distribution you used." If the shape is wrong, the benchmark is confidently wrong.
My update-churn test is more extreme than most workloads. Ten full-table updates is a stress test, not a Tuesday. The 768ms is the mechanism landing hard, not a prediction of your production latency. The takeaway is "dead rows disable pushdown," not "expect exactly 768ms."
The cloud instance throttles the transaction log. Several of my write numbers were capped by Azure's log rate governor, equally across storage types, so the ratios are trustworthy but the absolute write times would shrink on dedicated hardware.
It doesn't fix infrastructure. Connection pools, memory grants, concurrent load, deployment topology. none of that is in scope. ExoBench measures the query and the plan, not your cluster.

What it does fix, is the part where you make a storage-engine decision based on a sentence somebody said in a standup three years ago.

Try this on your own slow query

Find your candidate. The slowest aggregation you've got:

SELECT TOP 5
    qs.total_worker_time / qs.execution_count AS avg_cpu,
    qs.execution_count,
    SUBSTRING(qt.text, 1, 200) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY avg_cpu DESC;

The top result is your candidate. If it's a GROUP BY over a wide table with no selective WHERE, it's a columnstore question, and now you can settle it instead of guessing.

You already have Claude, or Cursor, or whatever you talk to. Head to exobench.ai to connect it, then paste a prompt shaped like the one I opened with, the same shape, pointed at your table:

Here's a slow SQL Server query and its schema [paste yours]. Use ExoBench to benchmark it at 1M and 3M rows, rowstore versus clustered columnstore, and show me the plans. Then update 20% of the rows and re-run the read. I want to see what the dead rows do to it.

That last sentence is the one the folklore can't answer and you now can. The AI translates the whole thing into ExoBench tool calls, spins up Azure SQL, and you read the output.

That GROUP BY that takes 433ms three thousand times a day? You don't have to wonder whether columnstore would fix it. You can know in about 90 seconds.

Appendix: How to fix it with REORGANIZE

The 768ms read collapse isn't permanent. It's a maintenance debt, and you settle it with one command. After the same 10x churn that produced the 50%-dead-row table, I ran:

ALTER INDEX cci_sales ON sales
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

Then the same GROUP BY again. Here's the full cycle, all three states measured:

State	Read elapsed	Scan EstimateIO	Aggregate pushdown?
Clean (no churn)	sub-ms	0.003	Yes (1,000,000 locally aggregated)
After 10x UPDATE (50% dead)	768 ms	1.716	No (pushdown disabled)
After REORGANIZE	sub-ms	0.003	Yes (1,000,000 locally aggregated)

The plan came back byte-for-byte to the fast path. EstimateIO dropped from 1.716 back to 0.003. ActualLocallyAggregatedRows = 1000000 reappeared on the scan. Batch-mode aggregate pushdown restored. The penalty is fully reversible.

What REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON) does, in plain terms: it rebuilds the compressed segments and physically drops the dead rows the deletion bitmap was tracking. Once the dead rows are gone, the bitmap is empty, and an empty bitmap is what aggregate pushdown requires to engage. So the read goes right back to crunching everything inside the scan.

The practical rule this gives you, fully measured:

Append-only or insert-mostly (no meaningful UPDATE/DELETE): clustered columnstore, no reservations. No dead rows ever accumulate, so there's nothing to maintain, and you keep the ~30x read forever.
Real UPDATE/DELETE traffic: clustered columnstore is still viable, but you must schedule REORGANIZE on a cadence matched to your churn. Skip it and reads degrade silently. there's no error, no warning, just a GROUP BY that used to be instant and now isn't. That silent rot is the actual trap, far more than the writes ever were.
Heavy churn with no maintenance window possible: this is the genuine case for staying on rowstore, or keeping a rowstore table with a nonclustered columnstore alongside it so the B-tree absorbs the writes and point lookups while the columnar index serves analytics.

One caveat on the recovery number. my REORGANIZE ran on an idle ephemeral instance. In production it runs online but competes with live traffic for CPU and I/O, so the operation itself has a cost I didn't measure here. What's proven is that the read penalty is fully recoverable and the fast plan returns. not that maintenance is free. Nothing is free. But "schedule a REORGANIZE" is a very different, very manageable problem than "columnstore has horrible write performance," which was never true in the first place.

Appendix: Postgres vs SQL Server Reference Data

The Postgres column in the opening table is a contrast point, not a controlled head-to-head, and I'd rather be explicit about why than let you assume the two numbers were produced under identical lab conditions. They weren't. Here's exactly how the two runs differed.

Different host, comparable hardware. SQL Server ran on Azure SQL Database Serverless, Postgres on Neon. two different managed platforms, but provisioned to roughly the same class: approximately 4 vCPU on each side, both backed by SSD storage. So this isn't a case of one engine getting a fatter machine. The compute and storage are close enough that the order-of-magnitude gap in the opening table is the storage format talking, not the tin underneath it. It's still two different managed services rather than one box running both engines, so treat it as a strong reference point, not a lab-controlled benchmark.

Different data generators, same shape. Both tables were 50 categories, uniform distribution, but the synthetic data was generated with each engine's native idiom:

	SQL Server	Postgres
`category_id`	`ABS(CHECKSUM(NEWID())) % 50 + 1`	`floor(random() * 50)::int + 1`
`amount`	fixed `123.45` in the read runs	`(floor(random() * 100000) / 100.0)` random
`created_at`	`DATEADD(MINUTE, -n, GETDATE())`	`NOW() - random() * 365 days`
Row generator	`GENERATE_SERIES`	`generate_series`

The category counts and the row counts match. The amount column doesn't. SQL Server's read benchmarks used a constant, Postgres used a random value. For a GROUP BY category_id that only aggregates amount, this barely moves the read, the grouping key is identical and uniform on both sides, but it's a real difference and you should know it's there.

What the Postgres number actually is. The 109ms / 312ms figures are the Postgres baseline: a parallel sequential scan with 2 workers feeding a partial HashAggregate. That's Postgres at its honest best for this query, no index helped (a covering B-tree was ignored by the planner, and forcing an index-only scan made it slower). The only native lever that moved it was raising the parallel worker count, which bought roughly 25%. Postgres has no columnstore in core, so there is no columnar number to report. that's the whole point of the column.

Bottom line. Read the opening table for the shape: rowstore scales with the data, Postgres scales with the data, columnstore doesn't. That conclusion is robust to every difference listed above, because it's an order-of-magnitude effect and these are second-decimal-place quibbles. If you want a defensible engine-vs-engine number rather than a shape, that's a different benchmark: identical hardware, identical data, and a research question that isn't actually what this post is about.