LSM-tree and block storage

Frits Hoogland - Feb 3 '23 - - Dev Community

YugabyteDB uses database storage that implements LSM-tree type storage. LSM-tree storage fundamentally differs from block based storage. Block based storage is used as the main storage implementation for many well known database such as PostgreSQL, Oracle, SQL Server, DB2 and many others.

Block based storage 101

In common block based storage implementation, data is stored in a datafile inside a block. Changes to the data are put in a write ahead or redo log before the change to the data inside the block in the datafile, and then applied to the block, potentially overwriting a previous version. A key observation and check is the existence of the datafile, which is the fundamental and singular place where the data is stored.

LSM-tree storage 101

In common LSM-tree based databases, data is stored as a sequential list of key-value pairs, and from a data perspective no data is overwritten ever, any change is an addition to the stream of key-value pairs.

LSM-tree based databases use a write ahead log too, which contains the records that are added because of inserts, but also updates and even deletes. A delete is an addition indicating a previously inserted record now is deleted.

Upon write ahead log addition, the changes are inserted to the first level of storage in a memtable. A memtable is memory only storage, and thus will be gone when the database stops.

This has some consequences:

  • Creating a table and inserting data will not (immediately) create a datafile, which are called 'SST files'.
  • Creating a table and inserting data will result in a memtable, holding the data.
  • To overcome stopping/starting the database, or anything else that removes the memtable, such as a crash, the write ahead log can be used to reconstruct the memtable.
  • This means that right after table creation and (limited) data insertion, there will be no datafiles (SST files).
  • This looks really weird when you come from a block based database and look into this for the very first time.

Of course a memtable cannot be grown unlimited, eventually any computer system will run out of memory, and the needed write ahead log storage to reconstruct a memtable will be huge. Therefore, a memtable has a configurable, limited size. Once a memtable reaches the set maximum size, the following happens:

  • The full memtable is made 'immutable'.
  • A new memtable is allocated, and will be used by the database.
  • The full memtable is written to disk as SST file.

The entire database now has its data split into two chunks: the memtable and the SST file. When the memtable reaches its maximum size again, the same procedure will happen, and the database is split into a memtable and two SST files, and so on.

The YugabyteDB implementation

This is all generic. If you want to know and see how this is implemented on YugabyteDB, read on:

I created a small and simple RF1 one node YugabyteDB universe, which makes it the most easy to see and understand. The same principles apply to RF3 clusters, which means a tablet will have 3 replicas instead of 1 with RF1.

  1. Create a table with a single tablet:

    create table test ( id int, f1 text ) split into 1 tablets;
    
  2. See what happened in YugabyteDB:

    ➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test
    Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
    
    Time between snapshots:    1.306 seconds
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  is_raft_leader                                                                       1 y/n                 +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker                                                                       4096 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB                                                             2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB_MemTable                                                    2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB                                                             2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB_MemTable                                                    2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  raft_term                                                                            1 terms               +0
    

    This shows that the table creation created a tablet with UUID 93bf6aa88e4841c1bbea28eb3bf0bcca, and TWO memtables created: one for IntentsDB and one for RegularDB. This is a YugabyteDB specific implementation detail: we use an IntentDB rocksdb (LSM-tree) database that holds locks and intermediate data, and a regularDB rocksdb (LSM-tree) database that will eventually hold the actual table data. Both LSM-tree databases have allocated a memtable, which must be empty because we have not inserted any data.

  3. Insert some data

    insert into test select id, repeat('x',1000) from generate_series(1,100) id;
    INSERT 0 100
    
  4. Look at the statistics

    ➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test
    Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
    
    Time between snapshots:    5.974 seconds
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  is_raft_leader                                                                       1 y/n                 +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker                                                                     331776 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB                                                           198656 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB_MemTable                                                  198656 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB                                                           133120 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB_MemTable                                                  133120 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  raft_term                                                                            1 terms               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  replicated_retryable_request_ranges                                                  1 reqs                +0
    

    Because we inserted more data, the memtables simply grown. The most obvious question now is how far this will grow. This is dependent on the parameter (gflag) memstore_size_mb, which is 128 by default.

  5. Let's insert more data so we fill up the memtable

    insert into test select id, repeat('x',1000) from generate_series(101,100000) id;
    INSERT 0 99900
    
  6. And look at the statistics again:

    ➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test
    Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
    
    Time between snapshots:    1.117 seconds
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  is_raft_leader                                                                       1 y/n                 +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  log_wal_size                                                                  75077698 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker                                                                   11800576 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_BlockBasedTable_IntentsDB                                         50239421 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_BlockBasedTable_RegularDB                                            65618 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB                                                         11798528 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_IntentsDB_MemTable                                                11798528 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB                                                             2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  mem_tracker_RegularDB_MemTable                                                    2048 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  raft_term                                                                            1 terms               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  replicated_retryable_request_ranges                                                  1 reqs                +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  rocksdb_current_version_num_sst_files                                                3 files               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  rocksdb_current_version_sst_files_size                                        11854842 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  rocksdb_current_version_sst_files_uncompressed_size                          119355043 bytes               +0
    192.168.66.80:9000   tablet   93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test                  rocksdb_total_sst_files_size                                                  11854842 bytes               +0
    

    The addition of more data filled up the memtable, and made it flush to disk. The IntentDB memtable holds approx. 11MB, the RegularDB memtable holds 2k.

The first thing to look at is the statistic log_wal_size: this is the write ahead log data to reconstruct the memtable. If you look back at the previous statistics you will notice it was not present there: this is a an issue: Change statistic log_wal_size to include both current and previous files. #14480.

The second thing to look at are the statistics that report the SST file information: rocksdb_current_version_num_sst_files: 3; we got 3 SST files, and rocksdb_current_version_sst_files_size and rocksdb_current_version_sst_files_uncompressed_size: an approximation of the size on disk, and an approximation of the size of the files if these are uncompressed.

The last thing that is interesting to see are the statistics mem_tracker_BlockBasedTable_IntentsDB and mem_table_BlockBasedTable_RegularDB statistics. These are the memory areas that are used as buffer cache: in principle, once a memtable is flushed to disk in an SST file, it sits on disk. The BlockBasedTable is a cache to hold SST file data in memory.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .