I have two tables named "one"
and "many"
and often need to join them. You may try to convince me that I should collocate (like in Aurora Limitless, or Citus, or interleaving like in Spanner) them because you believe that "joins don't scale". However, as a fan of SQL and relational databases with logical-physical independence, I prefer to distribute them independently for better agility and scalability. Developers also prefer to focus on business logic rather than thinking about shard locality and trying to control it.
Before any premature optimization, let's see if we can have an efficient and scalable execution plan with the right indexes.
I create the test case on YugabyteDB:
drop table if exists one, many;
create extension if not exists pgcrypto;
create table one (
primary key (one_id)
, one_id uuid default gen_random_uuid()
, category text
, created_at timestamptz default clock_timestamp()
);
insert into one (category)
with categories(category) as (
values ('π'), ('π'), ('π'), ('π'), ('π')
) select category from generate_series(1,100000), categories
-- this loads 2.5 million rows in "many" ^^^^^^ and 500k rows in "one"
;
create table many(
primary key (many_id)
, many_id uuid default gen_random_uuid()
, one_id uuid not null references one(one_id)
, value float
);
insert into many(one_id , value)
select one_id , random()
from one cross join generate_series(1,5)
;
SQL Query
I join the tables with a left outer join (I want all rows from "one"
and the details from two
) but I'm interested only by one category
, and the Top-42 most recent ones:
select one.category, one.created_at, many.value
from one
left outer join many using(one_id)
where one.category='π'
order by one.created_at desc
limit 42
;
To be scalable, the response time should not depend on the size of the tables but only on the size of the result, which is limited to 42 rows here. I am aiming for a time complexity of O(1), which is achievable with the normalized model I have created by implementing the appropriate indexes.
Indexing
I decompose the access patterns for each table by filtering, ordering, and columns to retrieve.
From table "one"
:
-
filter on
category
with an equality: my index will start withcategory HASH
- read the index entries in the
created_at
descending order: my index with addcategory DESC
-
return the
one_id
that will be used to join to"many"
: I addone_id
at the end of the key (it could also be in INCLUDE but as I don't expect to modify the primary key, and this index is not unique, it can fit in the index key)
From table "many"
:
- access to
one_id
with an equality, the join predicate: my index will start withone_id HASH
-
return the
value
: to avoid a hop to the table I include it in my index entry:INCLUDING (value)
Here are my indexes that cover those access patterns, probably useful for many other queries as well:
-- Access to "one" by category, ordered by "created_at"
create index one_category_hash_created_desc_id
on one(category, created_at desc, one_id)
;
-- Access to "many" by its foreign key to "one"
create index many_one_asc
on many ( one_id ) include ( value )
;
Execution Plan
I check the scalability from the execution plan:
psql (16.0, server 11.2-YB-2.19.3.0-b0)
yugabyte=# explain (costs off)
select one.category, one.created_at, many.value from one
left outer join many using(one_id)
where one.category='π'
order by one.created_at desc
limit 42;
QUERY PLAN
----------------------------------------------------------------------------
Limit
-> Nested Loop Left Join
-> Index Only Scan using one_category_hash_created_desc_id on one
Index Cond: (category = 'π'::text)
-> Index Only Scan using many_one_asc on many
Index Cond: (one_id = one.one_id)
I know the time complexity for it. With no Seq Scan and no Hash or Sort, I'll read only what I need: one range access to table "one"
that returns no more than 42 rows (my LIMIT), and 42 point access to table "many"
.
Execution Metrics
I can verify that on my sample dataset with 500000 rows in "one"
and 2500000 rows in "many"
. I want to verify that I don't read all those rows, but at maximum 42 ones:
psql (16.0, server 11.2-YB-2.19.3.0-b0)
yugabyte=# explain (costs off, analyze, dist, debug, summary off)
select one.category, one.created_at, many.value from one
left outer join many using(one_id)
where one.category='π'
order by one.created_at desc
limit 42;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (actual time=1.950..5.357 rows=42 loops=1)
-> Nested Loop Left Join
(actual time=1.948..5.342 rows=42 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one
(actual time=0.981..0.997 rows=9 loops=1)
Index Cond: (category = 'π'::text)
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.845 ms
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 43.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 43.000
Metric rocksdb_iter_bytes_read: 3996.000
Metric docdb_keys_found: 43.000
-> Index Only Scan using many_one_asc on many
(actual time=0.464..0.466 rows=5 loops=9)
Index Cond: (one_id = one.one_id)
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.414 ms
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 5.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 5.000
Metric rocksdb_iter_bytes_read: 522.667
Metric docdb_keys_found: 5.000
Planning Time: 0.174 ms
Execution Time: 5.178 ms
Storage Read Requests: 10
Storage Read Execution Time: 4.346 ms
Peak Memory Usage: 77 kB
On "one"
we have seek once (Metric rocksdb_number_db_seek: 1.000
) to the start of category = 'π'::text
and have read 42 rows from there just by going to the next entry (Metric rocksdb_number_db_next: 43.000
). This was only one Storage Index Read Requests
as we fetch by batch.
Only 9 of those rows (rows=9
) were necessary to reach the limit of 42 (because I have 5 matching rows in "many"
for each one. This means that we do 9 reads to "many"
(loops=9
). All other statistics there are per-loop.
Each of those read (Storage Index Read Requests: 1
) had to go to the index entry (Metric rocksdb_number_db_seek: 1.000
) and read the 5 matching rows from there (Metric rocksdb_number_db_next: 5.000
)
That's all. In total: 10 Storage Read Requests
. Even if they are on different data centers, that's 10 milliseconds.
Time complexity: O(1)
Of course you can test with more data, but databases are not magic and all the time complexity is explained. If your one-to-many cardinality is N
rows (was 5 in my example) and you want to read R
rows (was 42 in my example), with a latency of L
(1 millisecond between Availability Zones for example in AWS regions), then the response time is: L x ceil( R / N )
. If you fetch more than one thousand rows, you may see more reads as there's a maximum fetch size, but this use case, with pagination, is for a small number of rows that fits in the screen of an OLTP application. You don't see the table's number of rows in the formula: this is O(1) time complexity.
You can easily test it by adding more rows:
with one as (
insert into one (category)
select 'π' from generate_series(1,1000)
returning one_id
) insert into many(one_id , value)
select one_id , random()
from one cross join generate_series(1,100)
\;
-- stop when reaching 2GB
select random()/0 where
(pg_table_size('many'::regclass)+pg_table_size('one'::regclass))
/1024/1024/1024 >=2
-- run in a loop in psql
\watch 0.01
-- count the rows
select count(*) from one;
select count(*) from many;
explain (costs off, analyze)
select one.category, one.created_at, many.value from one
left outer join many using(one_id)
where one.category='π'
order by one.created_at desc
limit 42;
Even with 10x more rows in the table I join to, the response time remains unchanged. With the appropriate indexes, joins scale.
To summarize
With the right indexes, a join stays in milliseconds, whatever the size of the tables that are joined.
Finally, indexing is not black magic:
- you have a partition key (HASH) that can be used with equality predicate from WHERE or JOIN clause
- you have a sort key that can be used for inequalities in WHERE clause as well as for the ORDER BY clause
- and you have additional columns that your SELECT requires.
Check the execution plan. Your goal, for a performance critical use-case, is to read from an Index Only Scan for each table, start the join with the smallest one, and eliminate all unnecessary rows there before any Sort, Hash, or Group operation. Here, even without ANALYZE and Cost Based Optimizer, the query planner picked the right plan.