I was reading The Part of PostgreSQL We Hate the Most, by Bohan Zhang & Andy Pavlo. The article attributes the problems to a "relic of the 1980s". Let's compare with YugabyteDB which uses Postgres to process the SQL but with a distributed storage built for modern infrastructure.
Basically, the OtterTune article is explaining some cons of the MVCC implementation in PostgreSQL, with good explanations, but no nuances. In this post, I'll also mention the pros of it.
Note, I've also updated some wording, and added some precisions, to follow up with the comments from twitter followers. The goal of publishing this and get feedback is not to miss some points or be too biased by the tech I use the most today (Im Developer Advocate for YugabyteDB). All comments highly welcome🙏
Problems
There are 4 problems described in the article
Problem #1: Version Copying
When you update a single byte in a PostgreSQL row, the whole row is copied rather than only the change. This is easy to experiment:
drop table demo;
create table demo as
select generate_series(1,10000000) n, 'N' flag, lpad('x',1000,'x') filler;
vacuum demo;
select pg_size_pretty(pg_table_size('demo'));
explain (analyze, wal, buffers, costs off) update demo set flag='Y';
select pg_size_pretty(pg_table_size('demo'));
The interesting part is when I update the flag, check the WAL generated, and the increase of the table size:
postgresql=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
11 GB
(1 row)
postgresql=> explain (analyze, wal, buffers, costs off)
update demo set flag='Y';
QUERY PLAN
-----------------------------------------------------------------------------
Update on demo (actual time=103872.095..103872.097 rows=0 loops=1)
Buffers: shared hit=34285343 read=1428953 dirtied=2857543 written=1493088
I/O Timings: read=8671.563 write=259.356
WAL: records=20000354 fpi=1428940 bytes=11758071248
-> Seq Scan on demo (actual time=0.540..13265.632 rows=10000000 loops=1)
Buffers: shared read=1428572 written=32324
I/O Timings: read=8669.064 write=127.201
Planning:
Buffers: shared hit=7
Planning Time: 0.061 ms
Execution Time: 103872.487 ms
(11 rows)
select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
22 GB
(1 row)
The table size has doubled (from 11GB to 22GB) and the WAL generated is to full additional size (11758071248 bytes). Basically, all rows have been copied to the newer version with one byte changed, and all blocks involved were logged in the WAL for recovery purposes.
I'm running the same in a YugabyteDB cluster. The WAL size is included in pg_table_size
(to account for the real size as the first level of the LSM-Tree is in memory):
drop table demo;
create table demo as
select generate_series(1,10000000) n
, 'N' flag, lpad('x',1000,'x') filler;
select pg_size_pretty(pg_table_size('demo'));
explain (analyze, buffers, costs off) update demo set flag='Y';
Here is the result:
yugabyte=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
1572 MB
(1 row)
yugabyte=> explain (analyze, buffers, costs off)
update demo set flag='Y';
QUERY PLAN
-------------------------------------------------------------------------------
Update on demo (actual time=539216.462..539216.462 rows=0 loops=1)
-> Seq Scan on demo (actual time=19.515..514476.322 rows=10000000 loops=1)
Planning Time: 0.697 ms
Execution Time: 539264.892 ms
Peak Memory Usage: 27 kB
(5 rows)
yugabyte=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
1762 MB
(1 row)
The table size didn't increase a lot, which is what you can expect when you update only one byte for each row.
Here is a detailed description of the storage format:
Problem #2: Table Bloat
The PostgreSQL autovacuum doesn't reclaim the space from the filesystem. This needs a full reorg (VACUUM FULL) during a maintenance window.
postgres=> vacuum demo;
VACUUM
postgres=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
22 GB
(1 row)
postgres=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
11 GB
(1 row)
In YugabyteDB, there are no holes in the files. To reclaim space from the intermediate versions above the retention (15 minutes by default), the current files are merged to new files by the background compaction. It has the same effect as a VACUUM FULL in the sense that intermediate versions above the retention are removed and the files are packed. But it works at a lower layer with no conflict with ongoing transactions: the SSD files are immutable, holding table rows and index entries. There's no index rebuild needed as all is already sorted in each files. Compaction is just merging multiple SST Files one new files. This runs in the background and, even if it doesn't conflict with ongoing transactions, it consumes some CPU/RAM/IO. If compaction didn't happen for a while, the performance may be impacted by the size and read amplification. In this situation (CPU, RAM or IO) pressure, you can add more nodes to the cluster thanks to horizontal scalability.
I could reduce the table back to its original size (1572 MB) with full compaction, but that's not needed. The compaction happens at some thresholds that I've explained in a previous post:
Testing LSM-Tree merge for Size Amplification in YugabyteDB
Franck Pachot for YugabyteDB Distributed PostgreSQL Database ・ Jan 31 '23
Problem #3: Secondary Index Maintenance
PostgreSQL adds index entries for the new versions without removing the previous entries, until autovacuum is able to do it. This competes with the application activity, and is postponed when long transactions are running. Indexes do not have the visibility information. A consequence of it, not mentioned in the OtterTune article, is that even in case of Index Only Scan the table must be read to get the MVCC visibility. This can be fast on a freshly vacuumed table (because a visibility bitmap is updated) but long if not. Note that the index maintenance can be reduced by reserving enough free space (set with FILLFACTOR) so that the new version of the row fits in the same block (HOT optimization).
PostgreSQL tables are Heap Tables and the primary key is a secondary index. Adding a primary key to my table creates the index, and an access by primary key requires many reads:
postgres=> alter table demo add primary key (n);
ALTER TABLE
postgres=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
11 GB
(1 row)
postgres=> select pg_size_pretty(pg_table_size('demo_pkey'));
pg_size_pretty
----------------
214 MB
(1 row)
postgres=> explain (analyze, buffers, costs off) select * from demo where n=42;
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using demo_pkey on demo (actual time=0.861..0.862 rows=1 loops=1)
Index Cond: (n = 42)
Buffers: shared read=4
I/O Timings: read=0.843
Planning:
Buffers: shared hit=13 read=1 dirtied=3
I/O Timings: read=1.390
Planning Time: 2.056 ms
Execution Time: 0.889 ms
(9 rows)
For one row, 4 buffers have been read from the index (the B-Tree levels) and the table.
YugabyteDB stores the table in its primary key, like many other databases. All (table and indexes) are stored as LSM-Tree, with MVCC information included. There is no need for FILLFACTOR, no duplication of rows when updated. Even when a table row moves physically (during automatic re-sharding for example) the index doesn't add any maintenance overhead because it references the primary key and not a physical location.
An Index Scan from a secondary index has to go to the table though its primary key, which is fast in a LSM-Tree, and this can also be skipped with Index Only Scan that never have to read the table.
I've explained covering indexes here:
When I add a primary key in YugabyteDB (but you should better declare the primary key in the CREATE TABLE - I'm doing this for the demo only), the size is the same (there's no additional index) and the access by primary key is only one read:
yugabyte=> alter table demo add primary key (n);
ALTER TABLE
yugabyte=> select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
1314 MB
(1 row)
yugabyte=> select pg_size_pretty(pg_table_size('demo_pkey'));
pg_size_pretty
----------------
(1 row)
yugabyte=> explain (analyze, dist, costs off) select * from demo where n=42;
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using demo_pkey on demo (actual time=1.206..1.208 rows=1 loops=1)
Index Cond: (n = 42)
Storage Index Read Requests: 1
Storage Index Execution Time: 2.000 ms
Planning Time: 0.054 ms
Execution Time: 1.242 ms
Storage Read Requests: 1
Storage Write Requests: 0
Storage Execution Time: 2.000 ms
Peak Memory Usage: 0 kB
(10 rows)
The size is actually even smaller because the table has been rewritte to a new one, fully compacted. The access is displayed as Index Scan
but, on the primary key, this is the same ans an Index Only Scan
.
Problem #4: Vacuum Management
Because of bloat and also because of PostgreSQL transaction ID wraparound, vacuum must run frequently, and not be blocked by long transactions. This can be tricky with high a high rate of DML and the main point of the OtterTune article is to mention their tool that makes it easier to monitor.
YugabyteDB doesn't have this bloat problem. The space amplification of SST Files is resolved by the background compaction which doesn't conflict with on ongoing transactions as those read and write only the old SST Files until the newly compacted SST files are available. There is also no transaction ID wraparound as the transactions are sequenced with the cluster Hybrid Logical Clock which is always increasing.
In short, VACUUM is a no-op in YugabyteDB and is there to be compatible with scripts made for PostgreSQL:
yugabyte=# vacuum demo;
WARNING: VACUUM will be ignored
VACUUM
yugabyte=#
Advantages
The OtterTune article doesn't mention any advantages of PostgreSQL MVCC implementation. There are always some tradeoffs.
fast rollback
I have been working a lot with Oracle Database, that I've always considered as the best implementation of MVCC for Heap Tables and B-Trees:
MVCC in Oracle vs. PostgreSQL, and a little no-bloat beauty | by Franck Pachot | Medium
Franck Pachot ・ ・
franckpachot.Medium
However, there's one case where PostgreSQL is better. A rollback in PostgreSQL takes no time:
postgres=> \timing on
Timing is on.
postgres=> begin transaction;
BEGIN
Time: 31.719 ms
postgres=*> delete from demo;
DELETE 10000000
Time: 88588.840 ms (01:28.589)
postgres=*> rollback;
ROLLBACK
Time: 31.856 ms
Even if YugabyteDB has a different implementation of MVCC, the transaction provisional records go to the IntentsDB (another LSM-Tree) to be merge, in the background, to the RegularDB on commit.
YugabyteDB benefits from the same behavior: fast rollback:
yugabyte=> begin transaction;
BEGIN
Time: 31.178 ms
yugabyte=*> delete from demo;
DELETE 10000000
Time: 376758.463 ms (06:16.758)
yugabyte=*> rollback;
ROLLBACK
Time: 31.061 ms
This is not only useful for user rollbacks, but is also critical for the Recovery Time Objective. In case of recovery, the ongoing transaction have to be rolled back before the tables are available.
To do the same, Oracle (or MySQL InnoDB) has to go though the chain of rollback segments to undo all changes one by one before the table is available again. I've seen that also with transactions that never ends and the user asking to kill it. If you kill it, and even if you can restart the database, the rollback has to be done. All Oracle DBAs remember monitoring V$TRANSACTION.USED_UREC to see how it decreases and estimate when it will be available again.
DEMO@o21c_tp> set timing on
DEMO@o21c_tp> create table demo compress as select rownum n, 'N' flag, lpad('x',1000,'x') filler from xmltable('1 to 10000000');
Table DEMO created.
Elapsed: 00:00:24.971
DEMO@o21c_tp> alter table demo add primary key (n);
Table DEMO altered.
Elapsed: 00:00:07.714
DEMO@o21c_tp> delete from demo;
10,000,000 rows deleted.
Elapsed: 00:01:30.074
DEMO@o21c_tp> rollback;
Rollback complete.
Elapsed: 00:02:53.114
DEMO@o21c_tp>
The rollback was longer than the operation itself in this Oracle Autonomous Database.
Note that the impact of it is limited because the rollback happens in the background, so the database is already available for other operations, and long transactions without intermediate commits are usually user errors or application or release bugs.
index types
Another advantage of PostgreSQL MVCC implementation is that it doesn't push the transaction management complexity to the indexes. This is good for extensibility. There are 6 index types with PostgreSQL to be optimized for many use-cases, and it is extensible.
YugabyteDB, because of sharding, and the many capabilities of LSM-Trees, like the Hybrid Scan, may not need so many index types. Modern hardware and the ability to scale out makes Seq Scan efficient for more cases, especially with the pushed down Remote Filter. There's currently two types of indexes (LSM and GIN). GiST is in the roadmap to fully support PostGIS. LSM indexes also provide additional features that doesn't exists with PostgreSQL: loose index scan, scaling out,...
In summary
There are always trade-offs in IT. Some choices were made in different context and traditional databases may still use the same implementation on modern hardware. That doesn't mean that it didn't improve, and still can do. New databases have the possibility of different choices from the initial design. I'm very enthusiastic with any attempts to improve the storage of PostgreSQL: the abandoned zHeap project, the modern Oriole engine, the branching Neon, and of course the distributed YugabyteDB.
It is also important to understand that PostgreSQL is also good enough for many cases and @ryanbooz has written a nice article to counterbalance the many negative ones about PostgreSQL MVCC:
I have presented YugabyteDB as solving everything about MVCC here but there are also some tradeoffs. I have run the PostgreSQL and YugabyteDB on similar instance sizes (4 vCPU 16 GiB RAM on AWS). You have seen some operations being faster on PostgreSQL. That's because PostgreSQL is monolithic and works in shared memory. Fast, but stops in case of failure or maintenance. The YugabyteDB cluster has one instance on each Availability Zone, and the application continues if one AZ is down for planned ur unplanned reasons. This provides High Availability but adds some latency. The most important, before comparing different implementations, is to understand how it works and have the freedom of choice. Stay with Open Source, PostgreSQL or PostgreSQL-compatible, read multiple articles from different point of view. And test everything.