Clustering Factor for YugabyteDB Index Scan: correlation between secondary indexes and the primary key

Franck Pachot - Aug 6 - - Dev Community

In the previous blog post, I looked at the block cache metrics for an Index-Only Scan. I was reading fifty thousand rows from a covering index. In this article, I'll use a non-covering index so that, for each index entry, the row from the table must be read.

I created the following table where the primary key, generated by a sequence, and the value, from the clock timestamp, are constantly increasing.

create table demo (
   primary key (id asc)
 , id    bigserial 
 , value timestamptz default clock_timestamp()
 , rand  float default random()
);
create index demo_value_desc on demo ( value desc );
create index demo_value_asc  on demo ( value asc  );
insert into demo 
 select from generate_series(1,10000)
\watch c=100 i=0.01
Enter fullscreen mode Exit fullscreen mode

I'm interested in the rocksdb_block_cache_data_hit metrics, which are the number of 32KB blocks read from the index and table and the corresponding volume rocksdb_block_cache_bytes_read. I'll compare this to the volume read by the iterator, rocksdb_iter_bytes_read.

Index Only Scan

I'll query the first 50000 rows when ordered by the ascending timestamp value, which is also the ascending primary key:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select value from demo
 where value < now()
 order by value asc limit 50000;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Limit (actual time=1.465..35.837 rows=50000 loops=1)
   ->  Index Only Scan using demo_value_asc on demo (actual time=1.461..25.456 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 9.828 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_hit: 679.000
         Metric rocksdb_block_cache_index_hit: 196.000
         Metric rocksdb_block_cache_data_hit: 287.000
         Metric rocksdb_block_cache_bytes_read: 11606891.000
         Metric rocksdb_number_db_seek: 49.000
         Metric rocksdb_number_db_next: 50225.000
         Metric rocksdb_number_db_seek_found: 49.000
         Metric rocksdb_number_db_next_found: 50225.000
         Metric rocksdb_iter_bytes_read: 3203108.000
         Metric rocksdb_block_cache_multi_touch_hit: 679.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 11606891.000
         Metric docdb_keys_found: 50225.000
         Metric ql_read_latency: sum: 22961.000, count: 49.000
(20 rows)
Enter fullscreen mode Exit fullscreen mode

This had read 3GB (rocksdb_iter_bytes_read: 3203108) from the iterator, and 287 blocks were read from the cache (rocksdb_block_cache_data_hit: 287, rocksdb_block_cache_bytes_read: 11606891), which is 11GB.

I used an Index-Only Scan to get an idea of the size of the reads from the index. I'll select more columns for an Index Scan, and the additional reads will be in the table.

Primary key ASC, order by value ASC

I'll query the first 50000 rows when ordered by the ascending timestamp value, which is also the ascending primary key:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select * from demo
 where value < now()
 order by value asc limit 50000;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit (actual time=6.399..218.977 rows=50000 loops=1)
   ->  Index Scan using demo_value_asc on demo (actual time=6.396..209.168 rows=50000 loops=1)
         Index Cond: (value < now())
         Storage Table Read Requests: 49
         Storage Table Read Execution Time: 172.483 ms
         Storage Table Rows Scanned: 50176
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 1.300 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_hit: 1046.000
         Metric rocksdb_block_cache_index_hit: 294.000
         Metric rocksdb_block_cache_data_hit: 458.000
         Metric rocksdb_block_cache_bytes_read: 17245297.000
         Metric rocksdb_number_db_seek: 50225.000
         Metric rocksdb_number_db_next: 100401.000
         Metric rocksdb_number_db_seek_found: 50225.000
         Metric rocksdb_number_db_next_found: 100401.000
         Metric rocksdb_iter_bytes_read: 9998178.000
         Metric rocksdb_block_cache_single_touch_hit: 171.000
         Metric rocksdb_block_cache_single_touch_bytes_read: 5593326.000
         Metric rocksdb_block_cache_multi_touch_hit: 875.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 11651971.000
         Metric docdb_keys_found: 100401.000
         Metric ql_read_latency: sum: 149377.000, count: 98.000
(24 rows)
Enter fullscreen mode Exit fullscreen mode

This had read 9.5GB (rocksdb_iter_bytes_read: 9998178) from the iterator, and 458 blocks were read from the cache (rocksdb_block_cache_data_hit: 458, rocksdb_block_cache_bytes_read: 17245297) which is 16GB.

Doubling the number of blocks read from the cache when reading from the table in addition to the index looks correct. There was one seek per index row (rocksdb_number_db_seek_found: 50225), but not all were data block reads.

Primary key ASC, order by value DESC

I do the same but in descending order. It is still an Index Scan (not a backward one) because I have a descending index. The difference is that I read in the opposite order of the primary key:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select * from demo
 where value < now()
 order by value desc limit 50000;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit (actual time=7.334..303.677 rows=50000 loops=1)
   ->  Index Scan using demo_value_desc on demo (actual time=7.331..293.793 rows=50000 loops=1)
         Index Cond: (value < now())
         Storage Table Read Requests: 49
         Storage Table Read Execution Time: 258.097 ms
         Storage Table Rows Scanned: 50176
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 1.304 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_hit: 101136.000
         Metric rocksdb_block_cache_index_hit: 294.000
         Metric rocksdb_block_cache_data_hit: 100548.000
         Metric rocksdb_block_cache_bytes_read: 885045938.000
         Metric rocksdb_number_db_seek: 50225.000
         Metric rocksdb_number_db_next: 100401.000
         Metric rocksdb_number_db_seek_found: 50225.000
         Metric rocksdb_number_db_next_found: 100400.000
         Metric rocksdb_iter_bytes_read: 9948924.000
         Metric rocksdb_block_cache_multi_touch_hit: 101136.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 885045938.000
         Metric docdb_keys_found: 100401.000
         Metric ql_read_latency: sum: 233981.000, count: 98.000
(22 rows)
Enter fullscreen mode Exit fullscreen mode

The number of seek is still the same (rocksdb_number_db_seek: 50225), but the read has increased.

This had read 9.5GB (rocksdb_iter_bytes_read: 9948924) from the iterator and 100548 blocks from the cache (rocksdb_block_cache_data_hit: 100548, rocksdb_block_cache_bytes_read: 885045938), which is 844GB.

This demonstrates that reading many rows with an Index Range Scan is more efficient when the primary key is in the same order as the read index entries. The order of the primary key is beneficial for accessing by the primary key and also by the secondary indexes. In YugabyteDB, if you are reading a significant number of rows from an Index Scan, it is advisable to include all columns in the index to enable an index-only scan.


Solution 1: Covering Index

With a covering index, and an Index Only Scan, the index entries come in order and there's no need to read the table rows:

yugabyte=# drop index demo_value_desc;
DROP INDEX
yugabyte=# drop index demo_value_asc;
DROP INDEX

yugabyte=# create index demo_value_id_rand on demo (value desc, id)
           include (rand);
CREATE INDEX

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select id from demo
 where value < now()
 order by value desc limit 50000;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Limit (actual time=4.395..41.860 rows=50000 loops=1)
   ->  Index Only Scan using demo_value_id_rand on demo (actual time=4.384..32.231 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 11.976 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_number_db_seek: 49.000
         Metric rocksdb_number_db_next: 50225.000
         Metric rocksdb_number_db_seek_found: 49.000
         Metric rocksdb_number_db_next_found: 50225.000
         Metric rocksdb_iter_bytes_read: 3607244.000
         Metric docdb_keys_found: 50225.000
         Metric ql_read_latency: sum: 21958.000, count: 49.000
(14 rows)

Enter fullscreen mode Exit fullscreen mode

Solution 2: WITH clause to read a range of primary key

We have some information about our data that the query planner doesn't know. Since both the "value" timestamp and the "id" primary key are increasing, a small range of the "value" timestamp matches a small range of the "id" primary key. So, I can use a Common Table Expression to retrieve the "id" from an index that covers "id", and when joining to the table, I can narrow down the scan to a range of "id".

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
/*+ leading((ids demo)) */
with ids as (
 select id from demo
 where value < now()
 order by value desc limit 50000
), demo as (
 select * from demo
 where demo.id between (select min(id) from ids)
                   and (select max(id) from ids)
 order by value desc limit 50000
)
select * from ids join demo using (id)
order by value desc limit 50000
;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Limit (actual time=236.164..252.514 rows=50000 loops=1)
   CTE ids
     ->  Limit (actual time=1.268..31.875 rows=50000 loops=1)
           ->  Index Only Scan using demo_value_id_rand on demo demo_1 (actual time=1.267..20.841 rows=50000 loops=1)
                 Index Cond: (value < now())
                 Heap Fetches: 0
                 Storage Index Read Requests: 49
                 Storage Index Read Execution Time: 1.091 ms
                 Storage Index Rows Scanned: 50176
                 Metric rocksdb_number_db_seek: 49.000
                 Metric rocksdb_number_db_next: 50225.000
                 Metric rocksdb_number_db_seek_found: 49.000
                 Metric rocksdb_number_db_next_found: 50225.000
                 Metric rocksdb_iter_bytes_read: 3607244.000
                 Metric docdb_keys_found: 50225.000
                 Metric ql_read_latency: sum: 19107.000, count: 49.000
   CTE demo
     ->  Limit (actual time=128.020..143.166 rows=50000 loops=1)
           InitPlan 2 (returns $1)
             ->  Aggregate (actual time=61.111..61.111 rows=1 loops=1)
                   ->  CTE Scan on ids ids_1 (actual time=0.001..50.824 rows=50000 loops=1)
           InitPlan 3 (returns $2)
             ->  Aggregate (actual time=14.839..14.839 rows=1 loops=1)
                   ->  CTE Scan on ids ids_2 (actual time=0.001..5.571 rows=50000 loops=1)
           ->  Sort (actual time=128.017..132.636 rows=50000 loops=1)
                 Sort Key: demo_2.value DESC
                 Sort Method: external merge  Disk: 1672kB
                 ->  Index Scan using demo_pkey on demo demo_2 (actual time=77.065..104.354 rows=50000 loops=1)
                       Index Cond: ((id >= $1) AND (id <= $2))
                       Storage Table Read Requests: 49
                       Storage Table Read Execution Time: 12.507 ms
                       Storage Table Rows Scanned: 50000
                       Metric rocksdb_number_db_seek: 49.000
                       Metric rocksdb_number_db_next: 50048.000
                       Metric rocksdb_number_db_seek_found: 49.000
                       Metric rocksdb_number_db_next_found: 50047.000
                       Metric rocksdb_iter_bytes_read: 3592391.000
                       Metric docdb_keys_found: 50048.000
                       Metric ql_read_latency: sum: 21235.000, count: 49.000
   ->  Sort (actual time=236.160..241.652 rows=50000 loops=1)
         Sort Key: demo.value DESC
         Sort Method: external merge  Disk: 1672kB
         ->  Hash Join (actual time=181.161..214.175 rows=50000 loops=1)
               Hash Cond: (ids.id = demo.id)
               ->  CTE Scan on ids (actual time=1.272..9.795 rows=50000 loops=1)
               ->  Hash (actual time=179.869..179.869 rows=50000 loops=1)
                     Buckets: 65536 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 3247kB
                     ->  CTE Scan on demo (actual time=128.026..165.218 rows=50000 loops=1)
(48 rows)
Enter fullscreen mode Exit fullscreen mode

When using this method, a filter is applied to the inner table to retrieve only the rows corresponding to the join, with a few false positives that the Merge or Hash Join will later remove. It's important to note that unlike a Nested Loop join, this method does not maintain the outer table's order so that an additional sort will be required.

Solution 3: choose the best order for the primary key

My example is quite common: the primary key is an increasing sequence, and the application often queries the top-latest rows inserted from a secondary index on a timestamp. For such a use case, defining the primary key as DESC rather than ASC makes sense.

drop table if exists demo;
create table demo (
   primary key (id desc)
 , id    bigserial 
 , value timestamptz default clock_timestamp()
 , rand  float default random()
);
create index demo_value_desc on demo ( value desc );
insert into demo 
 select from generate_series(1,10000)
\watch c=100 i=0.01
Enter fullscreen mode Exit fullscreen mode

Here, the primary key order matches the order by value desc index entries:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select id from demo
 where value < now()
 order by value desc limit 50000;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit (actual time=5.294..212.153 rows=50000 loops=1)
   ->  Index Scan using demo_value_desc on demo (actual time=5.291..201.974 rows=50000 loops=1)
         Index Cond: (value < now())
         Storage Table Read Requests: 49
         Storage Table Read Execution Time: 165.126 ms
         Storage Table Rows Scanned: 50176
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 1.292 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_hit: 882.000
         Metric rocksdb_block_cache_index_hit: 294.000
         Metric rocksdb_block_cache_data_hit: 294.000
         Metric rocksdb_block_cache_bytes_read: 11809833.000
         Metric rocksdb_number_db_seek: 50225.000
         Metric rocksdb_number_db_next: 100401.000
         Metric rocksdb_number_db_seek_found: 50225.000
         Metric rocksdb_number_db_next_found: 100401.000
         Metric rocksdb_iter_bytes_read: 9747900.000
         Metric rocksdb_block_cache_single_touch_hit: 196.000
         Metric rocksdb_block_cache_single_touch_bytes_read: 6417677.000
         Metric rocksdb_block_cache_multi_touch_hit: 686.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 5392156.000
         Metric docdb_keys_found: 100401.000
         Metric ql_read_latency: sum: 141141.000, count: 98.000
(24 rows)

Enter fullscreen mode Exit fullscreen mode

This post explains that the index correlation or clustering factor is essential in YugabyteDB for Index Scan on a secondary index. It's important to note that in the example given, the difference in response time is insignificant, and reading fifty thousand rows was extreme. Therefore, premature optimization may not be necessary if it doesn't impact your performance criteria.

If you encounter such a problem, there are a few solutions:

  1. Add columns to your indexes so large range scans become Index Only Scans.
  2. The query planner doesn't know everything about your data and access patterns, but with SQL and maybe hints, you can shape the execution plan for something more efficient.
  3. If you use range sharding on a sequence, choose ASC or DESC based on the most critical large-range scans that perform an Index Scan. However, when the order of the generated primary key is not essential, you will generally use HASH to distribute the rows.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .