TL;DR: You can skip to the last execution plan to see the best solution (Order-Preserving Batched Nested Loop)
In the previous post, we have seen that in a Distributed SQL database, the response time is affected by the pagination LIMIT and the one-to-many cardinality. For larger paginations or when many rows do not match, especially with an inner join, the number of Nested Loops will increase, resulting in longer response times. To solve this issue, with YugabyteDB, you can use Batched Nested Loops. However, you may need to modify your paginating query to apply the limit to the driving table.
Note: I mentioned LIMIT, but in PostgreSQL and YugabyteDB, it is the same as FETCH FIRST ... ROWS ONLY, which is the SQL ANSI was to do it.
I'm building a similar example but in the category '🍓' with one million rows in table "one" there are no rows matching in table "many":
drop table if exists one, many;
create extension if not exists pgcrypto;
create table one (
primary key (one_id)
, one_id uuid default gen_random_uuid()
, category text
, created_at timestamptz default clock_timestamp()
);
create table many(
primary key (many_id)
, many_id uuid default gen_random_uuid()
, one_id uuid not null references one(one_id)
, value float
);
insert into one (category)
select '🍓' from generate_series(1,1000000)
;
-- Access to "one" by category, ordered by "created_at"
create index one_category_hash_created_desc_id
on one(category, created_at desc, one_id)
;
-- Access to "many" by its foreign key to "one"
create index many_one_asc
on many ( one_id ) include ( value )
;
I have not analyzed the tables and will use hints to guarantee the Nested Loop from "one" to "many".
Non-batched Nested Loops
I'm running a query that joins "one" to "many", sort the result on a column from "one", and limit to the first 1000 rows:
explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many)
*/
select one.category, one.created_at, many.value
from one
left outer join many using(one_id)
where one.category='🍓'
order by one.created_at desc
fetch first 10000 rows only;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (actual time=1.951..4494.646 rows=10000 loops=1)
-> Nested Loop Left Join (actual time=1.950..4489.391 rows=10000 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one
(actual time=1.241..35.741 rows=10000 loops=1)
Index Cond: (category = '🍓'::text)
Heap Fetches: 0
Storage Index Read Requests: 10
Storage Index Read Execution Time: 1.103 ms
-> Index Only Scan using many_one_asc on many
(actual time=0.431..0.431 rows=0 loops=10000)
Index Cond: (one_id = one.one_id)
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.394 ms
Execution Time: 4499.898 ms
Storage Read Requests: 10010
Storage Read Execution Time: 3943.079 ms
Storage Execution Time: 3943.079 ms
Peak Memory Usage: 132 kB
(22 rows)
This takes a long time because of the 10000 loops. If you try the same with an inner join instead of an outer join, this will be one million loops and given the latency in a distributed database, is not practicable.
Batched Nested Loops
I can batch those loops with Batched Nested Loop which I set to its maximum:
explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many)
Set( yb_bnl_batch_size 1024 )
*/
select one.category, one.created_at, many.value
from one
left outer join many using(one_id)
where one.category='🍓'
order by one.created_at desc
fetch first 10000 rows only;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (actual time=6736.806..6739.998 rows=10000 loops=1)
-> Sort (actual time=6736.804..6737.940 rows=10000 loops=1)
Sort Key: one.created_at DESC
Sort Method: top-N heapsort Memory: 1247kB
-> YB Batched Nested Loop Left Join
(actual time=7.996..6450.373 rows=1000000 loops=1)
Join Filter: (one.one_id = many.one_id)
-> Index Only Scan using one_category_hash_created_desc_id on one
(actual time=1.282..511.405 rows=1000000 loops=1)
Index Cond: (category = '🍓'::text)
Heap Fetches: 0
Storage Index Read Requests: 977
Storage Index Read Execution Time: 1.905 ms
-> Index Only Scan using many_one_asc on many
(actual time=4.907..4.907 rows=0 loops=977)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 4.043 ms
Execution Time: 6741.948 ms
Storage Read Requests: 1954
Storage Read Execution Time: 3951.758 ms
Storage Execution Time: 3951.758 ms
Peak Memory Usage: 4480 kB
(26 rows)
This reduced the number of loops to loops=977
but it had to read too many rows: rows=1000000
. Reading all rows from the outer table is not scalable. The reason is that the Batched Nested Loop doesn't preserve the ordering (before #19589). Then, all rows had to be sorted before being able to apply the pagination limit.
Batched Nested Loop and LIMIT in subquery
The solution in this case is to write the query differently and apply the pagination to the driving table. For this I replace "one" by a subquery that contains all the filters (WHERE clause), and pagination (ORDER BY ... LIMIT):
explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many)
Set( yb_bnl_batch_size 1024 )
*/
select one.category, one.created_at, many.value
from (
select * from one
where one.category='🍓'
order by one.created_at desc
fetch first 10000 rows only
) one
left outer join many using(one_id)
order by one.created_at desc
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (actual time=71.749..72.805 rows=10000 loops=1)
Sort Key: one.created_at DESC
Sort Method: quicksort Memory: 853kB
-> YB Batched Nested Loop Left Join
(actual time=7.819..67.987 rows=10000 loops=1)
Join Filter: (one.one_id = many.one_id)
-> Limit (actual time=1.391..8.820 rows=10000 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one
(actual time=1.389..6.697 rows=10000 loops=1)
Index Cond: (category = '🍓'::text)
Heap Fetches: 0
Storage Index Read Requests: 10
Storage Index Read Execution Time: 1.232 ms
-> Index Only Scan using many_one_asc on many
(actual time=4.778..4.778 rows=0 loops=10)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 3.823 ms
Execution Time: 74.776 ms
Storage Read Requests: 20
Storage Read Execution Time: 39.462 ms
Storage Execution Time: 39.462 ms
Peak Memory Usage: 2806 kB
(26 rows)
Now we are back to reading only rows=10000
rows from "one" and only loops=10
loops to "many", 10 batches of 1024 arrays. The time complexity is back to O(1) and depends only on the LIMIT, with at maximum one remote call per thousand of rows.
With the same indexing method as in previous post, my join remains scalable by pushing down the pagination to the driving table, and enabling yb_bnl_batch_size
. From a developer point of view, this is still easier than dealing with manual partitioning and physical colocation to reduce latency, or de-normalizing to avoid joins.
Order Preserving Batched Nested Loop in 2.20.2
The issue mentioned above is fixed. Look at the following execution plan where Sort Keys: one.created_at DESC
is mentioned below the join:
yugabyte=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 11.2-YB-2.20.2.0-b0 on aarch64-unknown-linux-gnu, compiled by clang version 16.0.6 (/opt/yb-build/llvm/yb-llvm-v16.0.6-yb-3-1695119965-1e6329f4-centos7-aarch64-build/src/llvm-project/clang 1e6329f40e5c531c09ade7015278078682293ebd), 64-bit
(1 row)
yugabyte=# explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
from one
left outer join many using(one_id)
where one.category='🍓'
order by one.created_at desc
fetch first 10000 rows only;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (actual time=5.520..50.596 rows=10000 loops=1)
-> YB Batched Nested Loop Left Join (actual time=5.519..47.795 rows=10000 loops=1)
Join Filter: (one.one_id = many.one_id)
Sort Keys: one.created_at DESC
-> Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.077..5.495 rows=10240 loops=1)
Index Cond: (category = '🍓'::text)
Heap Fetches: 0
Storage Index Read Requests: 10
Storage Index Read Execution Time: 0.959 ms
-> Index Only Scan using many_one_asc on many (actual time=2.865..2.865 rows=0 loops=10)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 2.037 ms
Planning Time: 0.502 ms
Execution Time: 53.284 ms
Storage Read Requests: 20
Storage Read Execution Time: 21.330 ms
Storage Write Requests: 0
Catalog Read Requests: 0
Catalog Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 21.330 ms
Peak Memory Usage: 2023 kB
(24 rows)