Fast PITR and MVCC reads with Key-Value LSM Tree

Franck Pachot - Sep 9 - - Dev Community

YugabyteDB uses PostgreSQL code to handle SQL queries without the challenges associated with PostgreSQL storage, such as bloat, vacuum, and transaction ID wraparound. Additionally, it avoids the complexities of a separate undo segment, which can lead to lengthy rollbacks and recovery. The storage system in YugabyteDB, using RocksDB LSM tree, is designed for distribution, replication, and auto-sharding. It leverages modern shared-nothing infrastructure with Solid-State Disks (SSD) commonly found in public and private clouds, where random reads are quick, but sequential writes are better than random writes.

Unlike traditional databases that store tuples in fixed blocks, LSM Trees store them in a key-value format, allowing new versions to be added to a key. To version the new values, YugabyteDB includes the commit time's Hybrid Logical Clock timestamp in the key. This timestamp determines the visibility of Multi-Version Concurrency Control (MVCC) for other transactions. The background compaction process removes intermediate MVCC versions after the specified retention period. When querying, the MVCC snapshot can be built without additional random reads to rollback segments or new tuple copies.


Consider this extreme scenario: I execute a query on the same row twice within a repeatable read transaction while multiple concurrent transactions carry out updates on this row a hundred thousand times. Block-based databases must review all the intermediate versions with random reads to locate the correct one. However, YugabyteDB can quickly access any version since the commit timestamp is part of the key.

I create a one-row table:

create extension orafce;
create table speaking_clock (id bigint, primary key(id asc), message text);
insert into speaking_clock values ( 42 );
Enter fullscreen mode Exit fullscreen mode

I use the following script run with pgbench -t10000 -c 10 to update that same row a hundred thousand times from concurrent transactions:

update speaking_clock set message = (
 select to_char(now(),'"At the third stroke, it will be "HH24SP hours MISP "minutes and" SSSP "seconds"')
);
Enter fullscreen mode Exit fullscreen mode

I use explain analyze to show the time for two selects:

yugabyte=# begin transaction isolation level repeatable read;
BEGIN

yugabyte=*# explain (analyze, dist, debug, costs off, summary off)
yugabyte-*# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=1.163..1.165 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.019 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 2.000
   Metric rocksdb_number_db_next: 3.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
   Metric rocksdb_iter_bytes_read: 354.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 65.000, count: 1.000
(12 rows)

yugabyte=*# \! pgbench -nt10000 -c10 -f update.sql --max-tries=10

scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
number of transactions retried: 5 (0.005%)
total number of retries: 5
latency average = 105.273 ms
initial connection time = 380.216 ms
tps = 94.991185 (without initial connection time)

yugabyte=*# explain (analyze, dist, debug, costs off, summary off)
yugabyte-*# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=3.648..3.650 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.913 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 3.000
   Metric rocksdb_number_db_next: 6.000
   Metric rocksdb_number_db_seek_found: 2.000
   Metric rocksdb_number_db_next_found: 6.000
   Metric rocksdb_iter_bytes_read: 782.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 118.000, count: 1.000
(12 rows)

yugabyte=*# commit;
COMMIT

Enter fullscreen mode Exit fullscreen mode

The first statement, which reads the table's current value, takes 1.165 milliseconds (2 seeks and 3 nexts in the LSM tree). It immediately finds the key's current version.
The second statement, after 100000 concurrent updates, has to find the version before those updates to read at the same time as the first statement of the repeatable read transaction. It took 3.650 milliseconds (3 seeks and 6 nexts in the LSM tree).
The impact of 100000 MVCC versions in between did only increase the response time from 1ms to 3ms.

After the COMMIT and before any garbage collection, I run the same query again that gets the current state:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
yugabyte-# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=1.005..1.018 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.885 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 2.000
   Metric rocksdb_number_db_next: 3.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
   Metric rocksdb_iter_bytes_read: 354.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 53.000, count: 1.000
(12 rows)
Enter fullscreen mode Exit fullscreen mode

Each key is sorted with the MVCC timestamp in descending order so that the latest version is immediately read.


Another benefit of this storage method is that RocksDB stores the data in immutable SST files. This means that taking a snapshot for cloning or point-in-time recovery is a fast operation because the SST files serve as the snapshot. Moreover, incremental backups come for free, as the SST files are already incremental.
When combined with MVCC, which allows for fast flashback query to any point within the retention period, obtaining a consistent database state with snapshots taken on different nodes becomes easy. The Hybrid Logical Clock defines the cluster-wide point in time for the clone or recovery process.

Point-in-time recovery | YugabyteDB Docs

Restore data to a specific point in time in YugabyteDB

favicon docs.yugabyte.com
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .