What's the equivalent of pageinspect in YugabyteDB?

Franck Pachot - Mar 19 - - Dev Community

In a previous post, Postgres dead tuple space reused without vacuum, I used pageinspect to examine the internals of MVCC in PostgreSQL. YugabyteDB is PostgreSQL compatible for the query layer but has a different storage, so pageinspect doesn't apply. The storage is adapted from RocksDB, and we can use the Yugabyte version of sst_dump.

This demo is similar to @denismagda's PostgreSQL MVCC Backstage but on YugabyteDB and we will talk about this during the next Open Hours:

I start a docker container from the YugabyteDB image:

docker run -it yugabytedb/yugabyte:2.20.2.0-b145 bash
Enter fullscreen mode Exit fullscreen mode

I start a single-node cluster:

export PATH="/home/yugabyte/bin:$PATH"
cd
yugabyted destroy
yugabyted start --advertise_address 0.0.0.0 --tserver_flags=ysql_enable_packed_row=false
Enter fullscreen mode Exit fullscreen mode

I disable the Packed Rows optimization to make the observation easier. We will talk about it later.

I connect to it:

yugabyted connect ysql
Enter fullscreen mode Exit fullscreen mode

I create the account table for the demo with a single row:

create table account(id int primary key, balance money, comment text);
insert into account values (1, 500, 'Deposit #1');
select * from account;
\q
Enter fullscreen mode Exit fullscreen mode

I have added a text column that is easier to find in the file (when encryption is disabled, of course)

Find the row on the disk

The yugabyted status output mentions the Data Directory location:
Image description

Let's search for my "Deposit" text under this directory:

sh-4.4# grep -r Deposit /root/var/data
Binary file /root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
Enter fullscreen mode Exit fullscreen mode

The only presence on disk is in the WAL - Write Ahead Logging.

Write Ahead Logging

I can check how it looks like:

sh-4.4# strings /root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 | grep -C3 -iE "^|Deposit.*"
yugalogf
* e2cada1498c64837a7ca7cba189093db0
balance
comment
public@
a@a
a@aH
SDeposit #1
[e}D
sh-4.4#
Enter fullscreen mode Exit fullscreen mode

I can see the presence of "Deposit #1" prefixed with a "S" for as it is stored as a String. Now you know why you need to always enable encryption in a production environment. It is free with YugabyteDB which is fully Open Source.

For the moment, my row is visible only in the WAL, which is written ahead, and not in any data file:

sh-4.4# grep -r Deposit .
Binary file ./wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
Enter fullscreen mode Exit fullscreen mode

The reason is that the first level of the LSM-Tree is in memory, called MemTable or MemStore, protected by WAL but stays in memory only until it is flushed.

Flush

I can force a flush and see the SST file:

sh-4.4# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets

sh-4.4# grep -r Deposit .

Binary file ./data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/000010.sst.sblock.0 matches
Binary file ./wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
Enter fullscreen mode Exit fullscreen mode

SST file

The SST file is contained in a directory that stores the RocksDB datastore for the tablet:

sh-4.4# cd $(dirname $(grep -rl Deposit /root/var/data/yb-data/tserver/data/rocksdb/))
sh-4.4# ls -l

total 160
-rw-r--r--. 1 root root     0 Mar 19 16:39 000003.log
-rw-r--r--. 1 root root 66381 Mar 19 16:49 000010.sst
-rw-r--r--. 1 root root    78 Mar 19 16:49 000010.sst.sblock.0
-rw-r--r--. 1 root root    16 Mar 19 16:49 CURRENT
-rw-r--r--. 1 root root    37 Mar 19 16:39 IDENTITY
-rw-r--r--. 1 root root     0 Mar 19 16:39 LOCK
-rw-r--r--. 1 root root   472 Mar 19 16:49 MANIFEST-000011
-rw-r--r--. 1 root root  4300 Mar 19 16:39 OPTIONS-000007
-rw-r--r--. 1 root root  4301 Mar 19 16:39 OPTIONS-000009
Enter fullscreen mode Exit fullscreen mode

I can check the character strings in the SST file:

sh-4.4# strings 000010.sst.sblock.0 | grep -C3 -iE "^|Deposit.*"

SDeposit #1

sh-4.4#
Enter fullscreen mode Exit fullscreen mode

However, we have a tool to decode it and show the RocksDB key-value documents stored by YugabyteDB:

sh-4.4# sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 16:56:42.758445  2920 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 16:56:42.758615  2920 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000010.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"

sh-4.4#
Enter fullscreen mode Exit fullscreen mode

I'll explain some options later. Each SQL row is a document with a key (DocKey), a sub-document for the row itself (SystemColumnId(0)) and one for each non-key column (ColumnId(1),ColumnId(2)).

You can recognize the values for these non-key columns: 50000 for the balance and Deposit #1 for the comment.

The key is the id, with the value [1] prefixed by a hash code 0x1210.

The Multi-Version Concurrency Control uses a Hybrid Time (HT) with a physical component (1710866355184535 is the epoch in microseconds for Tue Mar 19 16:39:15.184535 UTC 2024) and a logical time synchronized with Lamport clock to avoid clock skew.

I used -formatter_tablet_metadata to decode packed rows, an optimization to store all column values in one key. It is stored in protobuf format and can be decoded with yb-pbc-dump:

sh-4.4# yb-pbc-dump /root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
yb.tablet.RaftGroupReplicaSuperBlockPB 0
-------
primary_table_id: "000033bd000030008000000000004000"
raft_group_id: "e2cada1498c64837a7ca7cba189093db"
tablet_data_state: TABLET_DATA_READY
partition {
  partition_key_start: ""
  partition_key_end: ""
}
wal_dir: "/root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db"
kv_store {
  kv_store_id: "e2cada1498c64837a7ca7cba189093db"
  rocksdb_dir: "/root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db"
  tables {
    table_id: "000033bd000030008000000000004000"
    table_name: "account"
...
Enter fullscreen mode Exit fullscreen mode

Here, I disabled Packed Rows to show the values in the WAL, but with this metadata, they would be displayed in the same way even if, physically, the Sub Documents are stored as one Document.

I decoded the SST file with --output_format=decoded_regulardb because I'm looking at changes committed to RegularDB, clean from any transaction intents.

To understand the hash value 0x1210 for the value 1 you can use select yb_hash_code(1):

sh-4.4# yugabyted connect ysql
ysqlsh (11.2-YB-2.20.2.0-b0)

yugabyte=# select '0x'||to_hex(yb_hash_code(id)), * from account;
 ?column? | id | balance |  comment
----------+----+---------+------------
 0x1210   |  1 | $500.00 | Deposit #1
(1 row)

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

More rows inserted

Let's insert two more rows and flush the MemTable again:

yugabyte=# insert into account values (2, 600, 'Another'), (3, 700, 'One More');

INSERT 0 2

yugabyte=# \!  yb-ts-cli flush_all_tablets

Successfully flushed all tablets

yugabyte=# \!  sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:12:33.443876  3027 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:12:33.444073  3027 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000010.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"

Process ./000012.sst

Sst file format: block-based
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#

Enter fullscreen mode Exit fullscreen mode

The two new rows follow the same format, but in another SST file.

Let's compare with PostgreSQL. Yugabyte has no CTID, the physical location in PostgreSQL heap tables, because table rows are stored by their key, so the equivalent is the key itself (visible as DocKey). YugabyteDB has no XMIN because it uses the Hybrid Time, which will never wraparound. YugabyteDB has no XMAX because the new versions will have their place within the same DocKey and then the end of visibility is marked by the Hybrid Time of the next version.

All versions are stored together which avoids all the random reads you will find in PostgreSQL or Oracle to find the right version. Of course, after many flushes, the key may have versions in multiple SST files and this is why those are compacted in the background.

Updating a column

When a column is updated, YugabyteDB doesn't copy the whole row like PostgreSQL but simply adds the new version of the column as a new SubDocKey:

yugabyte=# update account set balance=100 where id = 1;

UPDATE 1

yugabyte=# \!  yb-ts-cli flush_all_tablets

Successfully flushed all tablets

yugabyte=# \!  sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:21:15.816469  3076 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:21:15.816608  3076 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000010.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"

Process ./000012.sst

Sst file format: block-based
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"

Process ./000013.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
yugabyte=#

Enter fullscreen mode Exit fullscreen mode

The newly flushed file contains the new value, 10000 at the new timestamp, 1710868871792282. A read will now have to seek into the key in all SST files and read the four subdocuments:

yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from account where id = 1;
                                     QUERY PLAN
-------------------------------------------------------------------
 Index Scan using account_pkey on account (actual time=1.089..1.092 rows=1 loops=1)
   Index Cond: (id = 1)
   Storage Table Read Requests: 1
   Metric rocksdb_number_db_seek: 1.000
   Metric rocksdb_number_db_next: 4.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
Enter fullscreen mode Exit fullscreen mode

Compaction

To avoid reading from too many files, they are compacted in the background. For this small demo, like I flushed manually, I compact manually:

yugabyte=# \!  yb-ts-cli compact_all_tablets
Successfully compacted all tablets

yugabyte=# \!  sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:26:01.916337  3118 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:26:01.916476  3118 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000014.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

Now, all versions are clustered in the same file, fast to read. The versions must be kept for MVCC to allow long queries to read a consistent snapshot but, at least, they are not scattered though the database.

MVCC retention

By default, the intermediate versions are kept for 15 minutes, defined by timestamp_history_retention_interval_sec=900. For this demo I reduce it the time to run a compaction:

yugabyte=# \! yb-ts-cli set_flag --force timestamp_history_retention_interval_sec 60

yugabyte=# \! yb-ts-cli compact_all_tablets
Successfully compacted all tablets

yugabyte=# \! yb-ts-cli set_flag --force timestamp_history_retention_interval_sec 900
Enter fullscreen mode Exit fullscreen mode

A new SST file now contains only the final versions, replacing the other SST files:

yugabyte=# \!  sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:30:28.030417  3182 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:30:28.030556  3182 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000015.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

The result if visible by one less subdocument to read:

yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from account where id = 1;
                                     QUERY PLAN
-------------------------------------------------------------------
 Index Scan using account_pkey on account (actual time=1.089..1.092 rows=1 loops=1)
   Index Cond: (id = 1)
   Storage Table Read Requests: 1
   Metric rocksdb_number_db_seek: 1.000
   Metric rocksdb_number_db_next: 3.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
Enter fullscreen mode Exit fullscreen mode

Note that with Packed Rows, the document would be:

SubDocKey(DocKey(0x1210, [1], []), [HT{ physical: 1710878213683630 }]) -> { 1: 50000 2: "Deposit #1" }
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710878309381790 }]) -> 10000
SubDocKey(DocKey(0xc0c4, [2], []), [HT{ physical: 1710878305322497 }]) -> { 1: 60000 2: "Another" }
SubDocKey(DocKey(0xfca0, [3], []), [HT{ physical: 1710878305322497 w: 1 }]) -> { 1: 70000 2: "One More" }
Enter fullscreen mode Exit fullscreen mode

with all columns in one SubDocument and less next operations:

   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 2.000
Enter fullscreen mode Exit fullscreen mode

Transaction intents

Many databases like PostgreSQL or Oracle write the transaction intents, with lock information, in the final blocks and have to clean them up later. YugabyteDB doesn't pollute the RegularDB with those but store them as provisional records in an IndentDB until they are committed. The RocksDB iterators are efficient at merging from multiple files (the M in LSM-Tree stands for Merge) and YugabyteDB leverages this by using two LSM-Trees per tablets.

I open a transaction to update the row and look at the SST Files before I commit:

yugabyte=# begin transaction;

BEGIN

yugabyte=# update account set balance=100 where id = 1;

UPDATE 1

yugabyte=# \! yb-ts-cli flush_all_tablets

Successfully flushed all tablets

yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:39:03.068542  3268 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:39:03.068686  3268 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000015.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

I took care of flushing the MemTable, but there's nothing new in RegularDB. The ongoing changes are stored in the IntentsDB in a sibling directory:

yugabyte=# \! pwd

/root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db

yugabyte=# \! ls ..

tablet-e2cada1498c64837a7ca7cba189093db  tablet-e2cada1498c64837a7ca7cba189093db.intents  tablet-e2cada1498c64837a7ca7cba189093db.snapshots

yugabyte=# \! ls $PWD.intents

000003.log  000012.sst  000012.sst.sblock.0  CURRENT  IDENTITY  LOCK  MANIFEST-000011  OPTIONS-000007  OPTIONS-000009

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

This is where I use --output_format=decoded_intentsdb because IntentsDB has more information about the transactions involved:

yugabyte=# \! sst_dump --command=scan --output_format=decoded_intentsdb --file=$PWD.intents --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:40:49.912428  3293 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:40:49.912580  3293 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process /root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db.intents/000012.sst

Sst file format: block-based
SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 2 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) none
SubDocKey(DocKey(0x1210, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 1 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) none
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1710869935661132 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) WriteId(0) 10000
TXN META 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 -> { transaction_id: 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 isolation: SNAPSHOT_ISOLATION status_tablet: 5c0155f301a44123a82f95a9fb3d0f87 priority: 10096702022479923001 start_time: { physical: 1710869935658346 } locality: GLOBAL old_status_tablet: }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 } -> SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1710869935661132 }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 w: 1 } -> SubDocKey(DocKey(0x1210, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 1 }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 w: 2 } -> SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 2 }

yugabyte=#
Enter fullscreen mode Exit fullscreen mode

My ongoing changes still use the DocKey and SubDocKey but hold additional information about locks. For example, the update has acquired strong locks on the column (kStrongRead, kStrongWrite) to let other transactions know that they cannot write on the same column. It also acquires weaker locks (kWeakRead, kWeakWrite) at higher level, to detect conflicts faster. The IntentsDB also adds reference to the transaction ID with two-way indexing: one to know if the transaction is committed when seeing a change, the other to find the intents to cleanup after the transaction commits or rolls back.

Commit

On commit, the new state is written to RegularDB and the Intents are deleted.

yugabyte=# commit;
COMMIT

yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets

yugabyte=# \! ls $PWD.intents
000003.log  CURRENT  IDENTITY  LOCK  MANIFEST-000011  OPTIONS-000007  OPTIONS-000009

yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:56:12.125682  3389 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:56:12.125821  3389 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []

Process ./000015.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"

Process ./000016.sst

Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710870886389512 }]) -> 10000
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
Enter fullscreen mode Exit fullscreen mode

The session that commits doesn't way for the intents to be written to RegularDB. As soon as the transaction status is updated, the provisional records are visible to other transactions.

Delete

A delete is like an update with a "tombstone" marker:

yugabyte=# delete from account;
DELETE 3

yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets

yugabyte=# \! yb-ts-cli compact_all_tablets
Successfully compacted all tablets

yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:59:46.082562  3448 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:59:46.082723  3448 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000018.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [HT{ physical: 1710871148344769 }]) -> DEL
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710870886389512 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [HT{ physical: 1710871148344769 w: 1 }]) -> DEL
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [HT{ physical: 1710871148344769 w: 2 }]) -> DEL
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
Enter fullscreen mode Exit fullscreen mode

The end of life of the rows is marked by -> DEL. All versions stay for the 15 minutes retention but are all together. The next compaction after the 15 minutes retention will remove all and a new SST File will be created, much smaller.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .