B-Tree vs. LSM-Tree: measuring the write amplification on Oracle Database and YugabyteDB

Franck Pachot - Mar 21 - - Dev Community

Databases maintain indexes to find a value or a range of values without scanning the whole table. The index is physically organized and ordered by the values you may look for so that each value has a logical place to find it.

With a tree structure over the index entries, the database can find this place with a few random reads for a minimal read amplification.

There are two major index structures used in databases:

  • B-Tree: The tree's small height, which increases logarithmically with the number of index entries, minimizes the read amplification required to find the index entry. However, maintaining this structure involves variable write amplification to keep it balanced.
  • LSM-Tree: The write amplification is limited to a log structure that appends the new index entry in memory. Those are flushed and compacted in the background, typically resulting in a few sorted files in one or multiple levels. The read amplification results from merging those sorted runs.

SQL databases perform frequent reading operations. Even a simple insert operation requires an index read to detect duplicate keys.

In the past, random reads were slow because they had to wait for the mechanical disk rotation (HDD). To achieve predictable performance on reads, databases designed at that time used B-trees. The higher levels of the B-tree could remain in cache, in shared memory buffers, and random reads were limited by the B-tree depth.

Today, with flash memory (SSD), random reads have low and predictable latency. It is now acceptable to do more reads to limit the write amplification.

Lot of modern databases use LSM trees, where writes are fast and appended in the first level, which is a memory table protected by WAL (written sequentially). To limit the space in memory, they are flushed to SST files and must be compacted in the background to limit the read amplification. This is done by reading the LSM tree (random reads) and writing to new SST files (sequential writes). LSM trees became popular, especially with RocksDB and derived databases, because they fit the performance of SSD storage where random reads are fast, but random writes should be avoided.

B-Tree write amplification with Oracle

Let's take an example to illustrate how writes to B-trees can be slow and unpredictable. I create the following table in Oracle 19c Autonomous Database:

create table demo ( 
 id raw(16) default sys_guid() primary key
 , ts timestamp default current_timestamp
 , reads int
 , writes int
);
Enter fullscreen mode Exit fullscreen mode

I will insert one million rows into this table and count the number of reads (db block gets for the current version and consistent gets for the MVCC version) and writes (db block changes). Rather than inserting random rows and observing the statistics, I'm inserting the statistics themselves to analyze later.

Here is what I've run in SQL Developer with a repeat command to insert one million rows that hold my session statistics about block read and writes:

insert into demo (reads, writes)
select bg+cg , bc from ( 
 select name,value from v$mystat 
 join v$statname using(statistic#) 
) pivot ( 
 sum(value) for name in ('db block changes' bc,'db block gets' bg
 , 'consistent gets' cg
 ,'redo size' rs) 
);
repeat 999999 0.001
commit;
Enter fullscreen mode Exit fullscreen mode

Remember to commit before losing your transaction after the autonomous idle timeout, as SQL Developer disobeys auto-commit with repeat.

I use this simple query to calculate the delta from the cumulative statistics, group on the number of reads and writes, and show the sum:

select count(*), reads+writes, reads, writes from (
select id
 , reads-lag(reads)over(order by ts) reads
 , writes-lag(writes)over(order by ts) writes
 from demo
) group by reads, writes order by reads+writes asc
;
   COUNT(*)    READS+WRITES    READS    WRITES
___________ _______________ ________ _________
    162,349              10        6         4
    776,173              11        7         4
          9              12        8         4
      4,711              12        7         5
          9              13        9         4
        588              13        7         6
     25,590              13        8         5
          1              14       10         4
      2,935              14        8         6
     17,414              14        9         5
        471              15        9         6
         11              15        8         7
      4,934              15       10         5
         98              16        9         7
        520              16       10         6
         65              17       10         7
         64              17       11         6
          4              18       11         7
...
         12             159       98        61
          1             160       98        62
          1             161      100        61
          3             169      120        49
          1             172      122        50
          1             173      126        47
          4             181      129        52
          1             182      132        50
          2             185      134        51
          1             186      134        52
          3             197      143        54
          1             199      138        61
          1             221      174        47
          1             709      705         4
          1             736      732         4
          1              

100 rows selected.
Enter fullscreen mode Exit fullscreen mode

There are two outcomes from it:

  • the number of logical reads an writes are unpredictable (but yon don't see that in general as they happen on shared buffers in memory).
  • The majority of inserts had to read and write eleven blocks. Some will be in the local memory cache on a monolithic system but will involve network latency on a distributed system.
  • a few outliers have to read hundreds of blocks (this is not only the index blocks but also the related rollback segments).

The reason is that inserting a new entry in a B-Tree

  • has to find the leaf block where the new value belongs, going from the root, through branches, to the leaf
  • when there is not enough space in the leaf, it has to split it into two blocks and update the branch above. There may be no space in the branch, and the same has to be done for it.
  • those changes generate undo information to allow consistent read

The general case can also be verified with a single-row insert:

DEMO@adb_tp> set autotrace on
Autotrace Enabled
Shows the execution plan as well as statistics of the statement.

DEMO@adb_tp> insert into demo (reads, writes) 
             /*+ gather_plan_statistics */
             values(null,null)
;
1 row inserted.

PLAN_TABLE_OUTPUT
_____________________________________________________________________________________
SQL_ID  0djxp9h81cby7, child number 0
-------------------------------------
insert into demo (reads, writes)               /*+
gather_plan_statistics */              values(null,null)


----------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |      0 |00:00:00.01 |       6 |
|   1 |  LOAD TABLE CONVENTIONAL | DEMO |      1 |      0 |00:00:00.01 |       6 |
----------------------------------------------------------------------------------
...
Statistics
-----------------------------------------------------------
               5  db block changes
               6  db block gets
               6  db block gets from cache
               5  db block gets from cache (fastpath)
           49152  logical read bytes from cache
               3  redo entries
             664  redo size
               6  session logical reads
             160  undo change vector size
Enter fullscreen mode Exit fullscreen mode

48KB have been read to insert a 30 bytes row in a table with no secondary indexes.

In addition to read and write amplification, B-Tree index blocks are often between 100% full and 50% empty because they split when they are full, leading to an average of at least 25% free space wasted in disk and memory.

LSM-Tree with YugabyteDB

I used the same approach on YugabyteDB, which uses PostgreSQL for the SQL processing and RocksDB for the storage:

create extension if not exists pgcrypto;
create table demo (
 id uuid default gen_random_uuid() primary key
 , ts timestamptz default clock_timestamp()
 , reads int
 , writes int
)
-- YugabyteDB specific: one tablet to make it simpler 
split into 1 tablets
;
Enter fullscreen mode Exit fullscreen mode

The cloud-native database exposes the storage metrics on a JSON endpoint on port 9000. I use a quick and dirty BASH and AWK script to get the metrics in a format that can be imported by COPY:

copy demo(reads,writes) from program $PROGRAM$
bash <<'BASH'
exec 5<>/dev/tcp/10.0.0.61/9000
awk '
/"table_name"/{n=0;t=$NF;gsub(/[",]/,"",t)}
t="demo" && /"name":/{n=$2;gsub(/[",]/,"",n)}
t="demo" && /"value": [^0]/{s[n]=s[n]+$NF}
END{printf "%s\t%s\n",s["rocksdb_number_db_seek"],s["rows_inserted"]
}
' <&5 & 
printf "GET /metrics?metrics=rows_inserted,rocksdb_number_db HTTP/1.0\r\n\r\n" >&5 
exit 0
BASH
$PROGRAM$;
\watch c=999999 0.001
Enter fullscreen mode Exit fullscreen mode

YugabyteDB doesn't work on blocks but reads and writes operations for table rows and index entries, which are replicated as Raft log and stored as RocksDB key values. Those operations are logical like the block reads and writes I measured in Oracle. They may operate in memory, be batched when sent through the network, and be fast, but they are the best representation of database work on data structures. The reads are the RocksDB seek to find the key in the LSM-Tree. The writes are the RocksDB keys inserted. I do not count the RocksDB next, which is equivalent to reading another row in a block.

I used the same query to list the number of reads and writes required for each insert, and aggregate the count.

select count(*), reads+writes, reads, writes from (
select id
 , reads-lag(reads)over(order by ts) reads
 , writes-lag(writes)over(order by ts) writes
 from demo
) as delta_values
group by reads, writes 
order by reads+writes asc;

 count  | ?column? | reads | writes
--------+----------+-------+--------
 994447 |        3 |     2 |      1
   3728 |       47 |     2 |     45
   1823 |       91 |     2 |     89
      1 |          |       |
(4 rows)
Enter fullscreen mode Exit fullscreen mode

There is no write amplification. The majority of inserts did only two reads and one write. The table row value, which is the same as the primary key index entry, is appended to the LSM-Tree. Less than one percent have written more than one RocksDB key-value, and it may not even be my session as I've read the global statistics.

As I did with Oracle, I can run a single insert and look at the statistics from explain analyze and my ybwr:

yugabyte=# explain (analyze, dist, costs off, summary off)
insert into demo (reads, writes)
             values(null,null)
;
                        QUERY PLAN
----------------------------------------------------------
 Insert on demo (actual time=0.076..0.076 rows=0 loops=1)
   ->  Result (actual time=0.017..0.017 rows=1 loops=1)
         Storage Table Write Requests: 1
(3 rows)

yugabyte=# execute snap_table;

 I-seek | I-next | I-prev | R-seek | R-next | R-prev | insert | dbname relname tablet (L)eader node | (I)ntents (R)regular
--------+--------+--------+--------+--------+--------+--------+------------------------------------------------------------
      4 |      4 |        |        |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF]   10.0.0.62
      4 |      4 |        |        |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF]   10.0.0.63
      6 |      4 |        |      2 |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF] L 10.0.0.61

Enter fullscreen mode Exit fullscreen mode

This matches exactly what I've seen above with 2 reads and 1 write but let's explain all these. I'm running a Replication Factor 3 database and that's why you see 1 write (RocksDB insert) on each node. When gathering the statistics, I gathered only from one node where the Raft leader was (10.0.0.61:9000). That's also why I've created a single tablet table on a 3 nodes cluster, to make it easier. Those are for the RegularDB, the final destination for rows.

I didn't add the IntentsDB, the transaction intents, when counting the reads because for small transactions they are all together in memory, and they include the background cleanup of provisional records after commit, which I didn't count for the B-Tree.

I didn't count the writes on the replicas. YugabyteDB waits for one of them to acknowledge. With Oracle, you would have two standby databases applying the same db block change when applying the redo stream, and wait for the commit to be acknowledged.

Space amplification in Oracle Heap Table + B-Tree

I check the size of the table and index:

DEMO@adb_tp> info demo
...
Indexes
INDEX_NAME           UNIQUENESS    STATUS    FUNCIDX_STATUS    COLUMNS
____________________ _____________ _________ _________________ __________
DEMO.SYS_C0036334    UNIQUE        VALID                       ID

DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='DEMO';

DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
50M

DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='SYS_C0036334';

DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
29M

Enter fullscreen mode Exit fullscreen mode

The heap table allocated 50MB and the B-Tree Index 29M

Space amplification in YugabyteDB LSM-Tree

The whole table is stored in the primary key in YugabyteDB:

yugabyte=# select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 178 MB
(1 row)
Enter fullscreen mode Exit fullscreen mode

This size includes the WAL but I didn't count the redo logs for Oracle. To compare, I check the size of the SST files from the console:

Total: 177.90M
Consensus Metadata: 1.5K
WAL Files: 128.01M
SST Files: 49.89M
SST Files Uncompressed: 88.91M
Enter fullscreen mode Exit fullscreen mode

The inserted data takes 50MB on disk

B-Tree read amplification with Oracle

Let's query a single non existing key:

DEMO@adb_tp> set autotrace on

DEMO@adb_tp> select * from demo
             where id=hextoraw('00000000000000000000000000000000')
;

no rows selected

PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________________
SQL_ID  a139wcm6m7vu4, child number 0
-------------------------------------
select * from demo where id=hextoraw('00000000000000000000000000000000')

Plan hash value: 2095372258

------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |      1 |        |      0 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEMO         |      1 |      1 |      0 |00:00:00.01 |       3 |
|*  2 |   INDEX UNIQUE SCAN         | SYS_C0036334 |      1 |      1 |      0 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ID"=HEXTORAW('00000000000000000000000000000000'))


Statistics
-----------------------------------------------------------
               1  DB time
            7733  RM usage
               3  Requests to/from client
               3  consistent gets
               3  consistent gets examination
               3  consistent gets examination (fastpath)
               3  consistent gets from cache
              12  non-idle wait count
               2  opened cursors cumulative
               1  opened cursors current
               1  pinned cursors current
              28  process last non-idle time
               3  session logical reads
               3  user calls
Enter fullscreen mode Exit fullscreen mode

This has read three blocks: 1 B-Tree root + 1 branch + 1 leaf.

LSM-Tree read amplification with YugabyteDB

Let's do the same with YugabyteDB:

yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from demo where id='00000000-0000-0000-0000-000000000000';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo (actual time=0.539..0.539 rows=0 loops=1)
   Index Cond: (id = '00000000-0000-0000-0000-000000000000'::uuid)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.447 ms
   Metric rocksdb_block_cache_hit: 1.000
   Metric rocksdb_block_cache_filter_hit: 1.000
   Metric rocksdb_block_cache_bytes_read: 65477.000
   Metric rocksdb_bloom_filter_useful: 1.000
   Metric rocksdb_number_db_seek: 1.000
   Metric rocksdb_block_cache_multi_touch_hit: 1.000
   Metric rocksdb_block_cache_multi_touch_bytes_read: 65477.000
   Metric ql_read_latency: sum: 46.000, count: 1.000
(12 rows)
Enter fullscreen mode Exit fullscreen mode

This was only 1 seek into the LSM tree (rocksdb_number_db_seek: 1.000).

To summarize

This measures the read and writes on the index when inserting a row in a SQL table. Different database architecture have different units:

  • To find where to read or write, B-Tree reads a few blocks, from root, though branches, to leaf. In an LSM-Tree, the equivalent is a seek operation to find the key in the SST Files and MemTables
  • To read from that point, B-Tree blocks are pinned and variable size rows are read. In an LSM-Tree, the equivalent is a next operation. With MVCC databases, this also involves reading some past versions. Those are in other blocks in Oracle (rollback segments) but are adjacent in YugabyteDB (within the next key).
  • To write, B-Tree inserts the new entry in the block, and may have to make more space by splitting it and updating the branches. In an LSM-Tree, writes are simply appended to the MemTable

Except with intensive workloads, you will not find huge differences on the response time because each database has been optimized for its architecture. Databases using B-Tree, like Oracle, keep most of the blocks involved in the local shared memory. Databases using LSM-Tree, like YugabyteDB, have enhanced RocksDB to avoid reading from all SST files (see Enhancing RocksDB for Speed & Scale).

For Distributed SQL, LSM-Tree goes well with Raft replication because synchronizing a log is simpler than synchronizing shared buffers.

LSM-Tree have other advantages as they write immutable files, flushed from MemTable, or merged by compaction. This reduces the risk of block corruption as they are never updated, and allows easy snapshot without an additional flashback logs.

If you are not familiar with LSM-Tree, @denismagda explains it when ordering pizzas:

How Java Litters Beyond the Heap: Distributed Databases | FoojayHow Java Litters Beyond the Heap: Distributed Databases | Foojay

Let’s create a Java application with YugabyteDB distributed database to see if garbage is generated in response to application requests.

favicon foojay.io
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .