How to Optimize Indexing for Distributed One-to-Many Join With Pagination

Franck Pachot - Dec 5 '23 - - Dev Community

I have two tables named "one" and "many" and often need to join them. You may try to convince me that I should collocate (like in Aurora Limitless, or Citus, or interleaving like in Spanner) them because you believe that "joins don't scale". However, as a fan of SQL and relational databases with logical-physical independence, I prefer to distribute them independently for better agility and scalability. Developers also prefer to focus on business logic rather than thinking about shard locality and trying to control it.

Before any premature optimization, let's see if we can have an efficient and scalable execution plan with the right indexes.

I create the test case on YugabyteDB:

drop table if exists one, many;
create extension if not exists pgcrypto;

create table one (
 primary key (one_id)
 , one_id uuid default gen_random_uuid()
 , category text
 , created_at timestamptz default clock_timestamp()
);

insert into  one (category)
  with categories(category) as (
   values ('🍎'), ('🍊'), ('🍌'), ('πŸ’'), ('🍐') 
  ) select category from generate_series(1,100000), categories
  -- this loads 2.5 million rows in "many" ^^^^^^ and 500k rows in "one"
;

create table many( 
 primary key (many_id)
 , many_id uuid default gen_random_uuid()
 , one_id uuid not null references one(one_id)
 , value float
);

insert into  many(one_id , value) 
 select one_id , random()
 from one  cross join generate_series(1,5)
;
Enter fullscreen mode Exit fullscreen mode

SQL Query

I join the tables with a left outer join (I want all rows from "one" and the details from two) but I'm interested only by one category, and the Top-42 most recent ones:

select one.category, one.created_at, many.value
 from one
 left outer join many using(one_id)
 where one.category='🍌'
 order by one.created_at desc
 limit 42
;
Enter fullscreen mode Exit fullscreen mode

To be scalable, the response time should not depend on the size of the tables but only on the size of the result, which is limited to 42 rows here. I am aiming for a time complexity of O(1), which is achievable with the normalized model I have created by implementing the appropriate indexes.

Indexing

I decompose the access patterns for each table by filtering, ordering, and columns to retrieve.

From table "one":

  1. filter on category with an equality: my index will start with category HASH
  2. read the index entries in the created_at descending order: my index with add category DESC
  3. return the one_id that will be used to join to "many": I add one_id at the end of the key (it could also be in INCLUDE but as I don't expect to modify the primary key, and this index is not unique, it can fit in the index key)

From table "many":

  1. access to one_id with an equality, the join predicate: my index will start with one_id HASH
  2. return the value: to avoid a hop to the table I include it in my index entry: INCLUDING (value)

Here are my indexes that cover those access patterns, probably useful for many other queries as well:


-- Access to "one" by category, ordered by "created_at"
create index one_category_hash_created_desc_id
 on one(category, created_at desc, one_id)
;

-- Access to "many" by its foreign key to "one"
create index many_one_asc
 on many ( one_id ) include ( value )
;

Enter fullscreen mode Exit fullscreen mode

Execution Plan

I check the scalability from the execution plan:

psql (16.0, server 11.2-YB-2.19.3.0-b0)

yugabyte=# explain (costs off)
select one.category, one.created_at, many.value  from one
 left outer join many using(one_id)
 where one.category='🍌'
 order by one.created_at desc
 limit 42;

                                 QUERY PLAN
----------------------------------------------------------------------------
 Limit
   ->  Nested Loop Left Join
         ->  Index Only Scan using one_category_hash_created_desc_id on one
               Index Cond: (category = '🍌'::text)
         ->  Index Only Scan using many_one_asc on many
               Index Cond: (one_id = one.one_id)


Enter fullscreen mode Exit fullscreen mode

I know the time complexity for it. With no Seq Scan and no Hash or Sort, I'll read only what I need: one range access to table "one" that returns no more than 42 rows (my LIMIT), and 42 point access to table "many".

Execution Metrics

I can verify that on my sample dataset with 500000 rows in "one" and 2500000 rows in "many". I want to verify that I don't read all those rows, but at maximum 42 ones:

psql (16.0, server 11.2-YB-2.19.3.0-b0)

yugabyte=# explain (costs off, analyze, dist, debug, summary off)
select one.category, one.created_at, many.value  from one
 left outer join many using(one_id)
 where one.category='🍌'
 order by one.created_at desc
 limit 42;

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Limit (actual time=1.950..5.357 rows=42 loops=1)
   ->  Nested Loop Left Join 
        (actual time=1.948..5.342 rows=42 loops=1)
         ->  Index Only Scan using one_category_hash_created_desc_id on one 
              (actual time=0.981..0.997 rows=9 loops=1)
               Index Cond: (category = '🍌'::text)
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.845 ms
               Metric rocksdb_number_db_seek: 1.000
               Metric rocksdb_number_db_next: 43.000
               Metric rocksdb_number_db_seek_found: 1.000
               Metric rocksdb_number_db_next_found: 43.000
               Metric rocksdb_iter_bytes_read: 3996.000
               Metric docdb_keys_found: 43.000
         ->  Index Only Scan using many_one_asc on many 
              (actual time=0.464..0.466 rows=5 loops=9)
               Index Cond: (one_id = one.one_id)
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.414 ms
               Metric rocksdb_number_db_seek: 1.000
               Metric rocksdb_number_db_next: 5.000
               Metric rocksdb_number_db_seek_found: 1.000
               Metric rocksdb_number_db_next_found: 5.000
               Metric rocksdb_iter_bytes_read: 522.667
               Metric docdb_keys_found: 5.000

 Planning Time: 0.174 ms
 Execution Time: 5.178 ms
 Storage Read Requests: 10
 Storage Read Execution Time: 4.346 ms
 Peak Memory Usage: 77 kB

Enter fullscreen mode Exit fullscreen mode

On "one" we have seek once (Metric rocksdb_number_db_seek: 1.000) to the start of category = '🍌'::text and have read 42 rows from there just by going to the next entry (Metric rocksdb_number_db_next: 43.000). This was only one Storage Index Read Requests as we fetch by batch.

Only 9 of those rows (rows=9) were necessary to reach the limit of 42 (because I have 5 matching rows in "many" for each one. This means that we do 9 reads to "many" (loops=9). All other statistics there are per-loop.

Each of those read (Storage Index Read Requests: 1) had to go to the index entry (Metric rocksdb_number_db_seek: 1.000) and read the 5 matching rows from there (Metric rocksdb_number_db_next: 5.000)

That's all. In total: 10 Storage Read Requests. Even if they are on different data centers, that's 10 milliseconds.

Time complexity: O(1)

Of course you can test with more data, but databases are not magic and all the time complexity is explained. If your one-to-many cardinality is N rows (was 5 in my example) and you want to read R rows (was 42 in my example), with a latency of L (1 millisecond between Availability Zones for example in AWS regions), then the response time is: L x ceil( R / N ). If you fetch more than one thousand rows, you may see more reads as there's a maximum fetch size, but this use case, with pagination, is for a small number of rows that fits in the screen of an OLTP application. You don't see the table's number of rows in the formula: this is O(1) time complexity.

You can easily test it by adding more rows:

with one as (
 insert into  one (category)
 select '🍌' from generate_series(1,1000)
 returning one_id
) insert into  many(one_id , value)
 select one_id , random()
 from one cross join generate_series(1,100)
\; 
-- stop when reaching 2GB
select random()/0 where 
 (pg_table_size('many'::regclass)+pg_table_size('one'::regclass))
 /1024/1024/1024 >=2
-- run in a loop in psql
\watch 0.01
-- count the rows
select count(*) from one;
select count(*) from many;

explain (costs off, analyze)
select one.category, one.created_at, many.value  from one
 left outer join many using(one_id)
 where one.category='πŸ’'
 order by one.created_at desc 
 limit 42;
Enter fullscreen mode Exit fullscreen mode

Execution Time: 5.577 ms
Even with 10x more rows in the table I join to, the response time remains unchanged. With the appropriate indexes, joins scale.

To summarize

With the right indexes, a join stays in milliseconds, whatever the size of the tables that are joined.

Finally, indexing is not black magic:

  • you have a partition key (HASH) that can be used with equality predicate from WHERE or JOIN clause
  • you have a sort key that can be used for inequalities in WHERE clause as well as for the ORDER BY clause
  • and you have additional columns that your SELECT requires.

Check the execution plan. Your goal, for a performance critical use-case, is to read from an Index Only Scan for each table, start the join with the smallest one, and eliminate all unnecessary rows there before any Sort, Hash, or Group operation. Here, even without ANALYZE and Cost Based Optimizer, the query planner picked the right plan.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .