Flashback query in YugabyteDB

Franck Pachot - Nov 15 '22 - - Dev Community

I've been working a lot with Oracle Database where many Flashback features were introduced to help recovering human errors. YugabyteDB is storing all changes into the LSM-Tree for a guaranteed retention of timestamp_history_retention_interval_sec (logically similar to Oracle's Undo Retention) so flashback is possible. I've explained in a previous post how to do the equivalent of Flashback Database. We will see now an equivalent of Flashback Query when you simply want to look at a previous state without changing anything.

In PostgreSQL, there's a possibility only when you can pro-actively define the reading point. I've explained that in a previous post. But YugabyteDB can do better thanks to a feature initially implemented to avoid the latency for read-only operations in geo-distributed deployement: Follower Reads. Note that I don't really like this name, because enabling it set yb_read_from_followers=on allows to read from any replica (tablet peer) and not only the Raft Leader.

This works in a read-only transaction (set default_transaction_read_only=on) not to mix-up two states of the databases: we read from the past but cannot write on anything else than the current state.

Default retention: 15 minutes, default flashback query: 30 seconds

I have a cluster where the retention is set to 15 minutes, which is the default:
timestamp_history_retention_interval_sec=900

I've created a demo table and update the timestamp every second:

create table demo (id int not null primary key, ts timestamptz);
insert into demo values (1, now());
update demo set ts=now() returning *;
\watch 1
Enter fullscreen mode Exit fullscreen mode

I query the current value and measure the gap with my current time:

yugabyte=> \timing on
Timing is on.
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:51:53.264774+00 | 00:00:00.177141
(1 row)

Time: 13.586 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:51:54.283723+00 | 00:00:00.648903
(1 row)
Time: 13.530 ms
Enter fullscreen mode Exit fullscreen mode

When I set the transaction read only and allow to read from followers, I read from a state from 30 seconds ago (which is the default):

yugabyte=> set yb_read_from_followers=on;
SET
Time: 12.489 ms
yugabyte=> set default_transaction_read_only=on;
SET
Time: 12.340 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:52:25.869116+00 | 00:00:30.102699
(1 row)

Time: 13.219 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:52:26.887153+00 | 00:00:30.633921
(1 row)

Time: 13.391 ms

yugabyte=> show yb_follower_read_staleness_ms;

 yb_follower_read_staleness_ms
-------------------------------
 30000
(1 row)
Enter fullscreen mode Exit fullscreen mode

You can see that the time it takes is nearly the same. The reason is that the versions are stored in the LSM-Tree adjacent to the current value. It can even be faster that querying the current value because we can read from any tablet peer.

Flashback query on the past 30 minutes

I can query from any point in time within the retention. Here is an example with 1 minute, 5 minutes, 15 minutes and even 30 minutes ago:

yugabyte=> set yb_follower_read_staleness_ms=60000;
SET
Time: 12.022 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:54:39.262261+00 | 00:01:00.010521
(1 row)

Time: 13.377 ms
yugabyte=> set yb_follower_read_staleness_ms=300000;
SET
Time: 12.041 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:50:40.971803+00 | 00:05:01.021684
(1 row)

Time: 13.337 ms
yugabyte=> set yb_follower_read_staleness_ms=900000;
SET
Time: 12.020 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:40:43.532422+00 | 00:15:00.742104
(1 row)

Time: 13.351 ms
yugabyte=> set yb_follower_read_staleness_ms=1800000;
SET
Time: 12.021 ms
yugabyte=> select *, now()-ts gap from demo;
 id |              ts               |       gap
----+-------------------------------+-----------------
  1 | 2022-11-15 09:25:46.484078+00 | 00:30:00.521903
(1 row)

Time: 15.601 ms
Enter fullscreen mode Exit fullscreen mode

The time to get that slightly increases, but it is minimal because all versions are stored with the key. There's nothing like long rollback segments here. This is a native feature of the LSM-Tree.

I was able to read from 30 minutes ago, even with timestamp_history_retention_interval_sec set to 15 minutes in my cluster, because I was lucky. The versions above the retention are cleared by the background compaction to save space, which can happen later.

If I try to go too far I get an error:

yugabyte=> set yb_follower_read_staleness_ms=3600000;
ERROR:  cannot enable yb_read_from_followers with a staleness of less than 2 * (max_clock_skew = 500000 usec)
Time: 1018.724 ms (00:01.019)
Enter fullscreen mode Exit fullscreen mode

The error message is not correct. I've opened a github issue for that: [Allow yb_follower_read_staleness_ms higher than 35 minutes]. This feature was initially designed for short staleness for performance only. The message here is the one when trying flashback query on a time that is too short:

yugabyte=> set yb_follower_read_staleness_ms=1;
ERROR:  cannot enable yb_read_from_followers with a staleness of less than 2 * (max_clock_skew = 500000 usec)
Time: 1018.962 ms (00:01.019)
Enter fullscreen mode Exit fullscreen mode

The reason for this is simple: YugabyteDB is distributed, and each node may have a time (synchronized by NTP) that is slightly different. We define the maximum clock skew (max_clock_skew_usec) to know when we read a state that cannot be guaranteed to be the latest. We rely on this to know how stale is a Follower. For Flashback Query, this is not a problem as you can wait 15 seconds and try your query again.

Flashback query as of timestamp

What if you don't want to set a stale value in milliseconds but an absolute time?

Here is an example:

update pg_settings set setting=round(1000*extract(epoch from 
 now()-timestamp '2022-11-15 10:10:10'
)) where name='yb_follower_read_staleness_ms';

update pg_settings set setting=true where name='yb_read_from_followers';

update pg_settings set setting=true where name='default_transaction_read_only';

Enter fullscreen mode Exit fullscreen mode

Using UPDATE to change a parameter (to the the time arithmetic) is a little tricky, as the transaction must be read-write for that. That's why I've set read-only after it. Here is the output:

as of timestamp

Performance

This uses a feature that is built into the way YugabyteDB stores table rows and index entries: documents hashed and sorted on the table (or index) key, then sorted by timestamp. Usually, queries access to the most recent entry but going to the next one is not a hard work. That's completely different from Oracle (which has to clone the blocks, go through the undo log, apply the rollback segments) or PostgreSQL (which has to follow the pointers though tuples at different location)

Here is a quick test using my ybwr on this demo table which has 3 tablets across 3 nodes.

-- load YBWR to gather tablet server statistics

\! curl -s https://raw.githubusercontent.com/FranckPachot/ybdemo/main/docker/yb-lab/client/ybwr.sql | grep -v '\watch' > ybwr.sql
\i ybwr.sql

-- execute 1000 updates

DO $$
 begin
  for i in 1..1000 loop
   update demo set ts=now();
   commit;
  end loop;
 end;
$$;

-- query the current value and get statistics

execute snap_reset;
select *, now()-ts gap from demo; 
execute snap_tablet;

-- query the 30 seconds before state and get statistics

execute snap_reset;
set yb_read_from_followers=on;
set default_transaction_read_only=on;
select *, now()-ts gap from demo; 
set yb_read_from_followers=off;
set default_transaction_read_only=off;
execute snap_tablet;
Enter fullscreen mode Exit fullscreen mode

Here are the RocksDB stats for the current read: 1 seek() to go to the start of each tablet (this is a Seq Scan) and 1 next to find the first row (I have only one row here).

 value | namespace_name |  table_name  |         metric_name          |            tablet_id             | host | Raft | table
-------+----------------+--------------+------------------------------+----------------------------------+------+------+-------
   141 | yugabyte       | demo         | rocksdb_db_iter_bytes_read   | fb420658e47e4667a06312863f64070d | 10   | L    |   141
     2 | yugabyte       | demo         | rocksdb_number_db_next       | fb420658e47e4667a06312863f64070d | 10   | L    |     2
     2 | yugabyte       | demo         | rocksdb_number_db_next_found | fb420658e47e4667a06312863f64070d | 10   | L    |     2
     1 | yugabyte       | demo         | rocksdb_number_db_seek       | 24d167c71b0042f68c96c0a7db9cd297 | 10   | L    |     4
     1 | yugabyte       | demo         | rocksdb_number_db_seek       | da2d1ad1700d4a7697eb7f880dc83398 | 10   | L    |     4
     2 | yugabyte       | demo         | rocksdb_number_db_seek       | fb420658e47e4667a06312863f64070d | 10   | L    |     4
     1 | yugabyte       | demo         | rocksdb_number_db_seek_found | fb420658e47e4667a06312863f64070d | 10   | L    |     1
Enter fullscreen mode Exit fullscreen mode

Here are the RocksDB stats for the read from before those 1000 inserts: 1000 additional next() to get down the previous versions, no additional seek() because they are all stored in the same range.

 value | namespace_name |  table_name  |             metric_name              |            tablet_id             | host | Raft | table
-------+----------------+--------------+--------------------------------------+----------------------------------+------+------+-------
     1 | yugabyte       | demo         | consistent_prefix_read_requests      | 24d167c71b0042f68c96c0a7db9cd297 | 10   | L    |     3
     1 | yugabyte       | demo         | consistent_prefix_read_requests      | da2d1ad1700d4a7697eb7f880dc83398 | 10   |      |     3
     1 | yugabyte       | demo         | consistent_prefix_read_requests      | fb420658e47e4667a06312863f64070d | 10   |      |     3
     1 | yugabyte       | demo         | pgsql_consistent_prefix_read_rows    | fb420658e47e4667a06312863f64070d | 10   |      |     1
 47141 | yugabyte       | demo         | rocksdb_db_iter_bytes_read           | fb420658e47e4667a06312863f64070d | 10   |      | 47141
  1002 | yugabyte       | demo         | rocksdb_number_db_next               | fb420658e47e4667a06312863f64070d | 10   |      |  1002
  1002 | yugabyte       | demo         | rocksdb_number_db_next_found         | fb420658e47e4667a06312863f64070d | 10   |      |  1002
     1 | yugabyte       | demo         | rocksdb_number_db_seek               | 24d167c71b0042f68c96c0a7db9cd297 | 10   | L    |     4
     1 | yugabyte       | demo         | rocksdb_number_db_seek               | da2d1ad1700d4a7697eb7f880dc83398 | 10   |      |     4
     2 | yugabyte       | demo         | rocksdb_number_db_seek               | fb420658e47e4667a06312863f64070d | 10   |      |     4
     1 | yugabyte       | demo         | rocksdb_number_db_seek_found         | fb420658e47e4667a06312863f64070d | 10   |      |     1
     1 | yugabyte       | demo         | rocksdb_number_superversion_acquires | da2d1ad1700d4a7697eb7f880dc83398 | 10   |      |     2
     1 | yugabyte       | demo         | rocksdb_number_superversion_acquires | fb420658e47e4667a06312863f64070d | 10   |      |     2
Enter fullscreen mode Exit fullscreen mode

One other big difference is that the current state reads were all from the Raft Leader (L) tablet peer, but the flashback query has read from the followers so that it is all local in this 3 nodes cluster.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .