With the perfect query planner (aka optimizer) you can provide as many indexes as possible, all join methods, and it will figure out the best execution plan. This blog post is the opposite, given a query with joins and filters, is there a single set of indexes where the execution plan will be acceptable for the user, even if the optimizer is wrong in its estimations?
What is the best plan? The query planner, or optimizer, has one goal: evaluate the cost of all possible execution plans and choose the one with cheapest estimated cost. This would work if the planning (coming from the estimation of selectivity and ressource access times) was easy. But SQL queries are complex and executed with variables values for which the selectivity can diverge by several orders of magnitude. Executing a join algorithm that is optimal for few rows can be disastrous with higher cardinalities. Because data grows and new use cases are added, this will happen one day and block the application.
The optimizer always tries to find the perfect plan, but that's not what users want. Perfect is the enemy of good. Users are not gamblers expecting a plan that, if all gods are with them, will have a blazing fast execution. They just want to stay forever with the performance they accepted when validating the software.
What the user wants is, in order of importance:
- consistent performance: response time should be in the same ballpark as what has been tested and accepted
- understandable performance: time is proportional to the user perception of the query complexity
- faster response time, but this depends on the two others. Without the predictable performance from the two prior points, it is impossible to set any tuning goal
The failure to provide the above properties was one reason for NoSQL. Without a cache, with one access path, with no joins, no wait for consistency, the performance becomes more predictable. But, if if you want to remove the optimizer choices, you can also do that in a SQL database, with planner parameters and optimizer hints.
Here is a small example to get a better understanding on the join methods (Nested Loop, Hash Join, Merge Join) efficiency when they are not the best plan.
Two tables
I create the following tables, with one million rows, on YugabyteDB (PostgreSQL compatible):
create table a ( id int, filter int );
create table b ( id int, filter int );
insert into a select n, sign(mod(n,1e5))
from generate_series(1,1e6) n;
insert into b select n, sign(mod(n,1e5))
from generate_series(1,1e6) n;
My goal is to filter on each table using the filter
column, and then join on id
. I've built the filter
value to test for different selectivity:
-
where filter=0
returns 10 rows -
where filter=1
returns 999990 rows
This will allow me to test the extreme cases of cardinalities, with each join method.
Two indexes
I don't want to give more choices to the optimizers, and I create the only indexes which are good for all join methods:
create index a_filter_id on a ( filter, id asc);
create index b_filter_id on b ( filter, id asc);
Those indexes can be efficient for:
- filtering (
filter = ?
) because the index starts with the filtering column. An Index Scan can then seek directly to the range of data of interest. This will be perfect for a Hash Join or Merge Join where both sets of rows have to be scanned. - retrieving the rows in the same order for on the join column (
id
) because, once they are filtered on the index range, the output is sorted on this column. This will save for additional sorting for the Merge Join. - covering all columns, especially the join column (
id
) are available without going to the table, to allow an Index Only Scan.
I can check all this by running an explain
for each table and its predicate ordered by the column I'll use for the join:
yugabyte=# explain select id, filter from a where filter=0 order by id;
QUERY PLAN
-----------------------------------------------------------------------------
Index Only Scan using a_filter_id on a (cost=0.00..15.25 rows=100 width=8)
Index Cond: (filter = 0)
(2 rows)
This is the fastest I can do to minimize the rows I have to read. Even if in some case, where the selectivity is low, a Seq Scan could be faster, I'll force the use of the index with hints (IndexOnlyScan(a) IndexOnlyScan(b)
) because my goal is to verify the performance of one execution plan with different selectivity. This is what many applications are doing when they want predictable execution plans: force the indexes with hints, with a rule-based optimizer, or by forcing a very low cost for index access (you can look at the SAP recommended configuration for Oracle, disabling all optimizer features that came after the rule based optimizer).
I'll run the queries several times as I want to compare on a warm cache but show only one here. You can reproduce it if you want.
I didn't ANALYZE the table and didn't enable yb_enable_optimizer_statistics
as my goal is to force the plans and look at the execution time.
Small inner and outer tables
I start with the highest selectivity on both side: filter=0
returns 3 rows.
explain (costs off, analyze, summary off)
/*+ Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=0;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Merge Join (actual time=1.137..1.153 rows=10 loops=1)
Merge Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=0.733..0.738 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Materialize (actual time=0.400..0.406 rows=10 loops=1)
-> Index Only Scan using b_filter_id on b (actual time=0.392..0.397 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
(9 rows)
Running it multiple times, it takes 1 to 2 milliseconds with a Merge Join. This is good, single digits milliseconds to return few rows. Let's see what happens if another execution plan is chosen. I disable Merge Join:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=0;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Join (actual time=1.046..1.053 rows=10 loops=1)
Hash Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=0.558..0.563 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Hash (actual time=0.468..0.468 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using b_filter_id on b (actual time=0.454..0.459 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
(10 rows)
With Hash Join, the time is similar. There are only 3 rows to hash or materialize so, even if those join methods have been invented for larger rowsets, they are still good for small ones, as long as the high selectivity filtering is done when scanning, thanks to the index.
Let's disable Merge Join as well and see what happens with Nested Loop:
explain (costs off, analyze, summary off)
/*+ Set(enable_hashjoin off) Set(enable_mergejoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=0;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (actual time=1.189..4.098 rows=10 loops=1)
-> Index Only Scan using a_filter_id on a (actual time=0.675..0.684 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=0.329..0.330 rows=1 loops=10)
Index Cond: ((filter = 0) AND (id = a.id))
Heap Fetches: 0
(7 rows)
The execution time goes to 4 milliseconds. This is still acceptable. Especially given that the performance is proportional to the number of rows. You can see that the first row was joined in one millisecond.
When joining few rows from the outer table to few rows from the inner table, any join method gives acceptable performance, with a slight preference for Hash or Merge join. My qualification of "the best one" will not be in 1ms vs. 4ms but in how they are still acceptable when the cardinalities change.
Larger inner table
Let's see what happens with different selectivity. I change the filter on the inner table: b.filter=1
which returns 999990 rows.
explain (costs off, analyze, summary off)
/*+ Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Merge Join (actual time=3965.831..3965.831 rows=0 loops=1)
Merge Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=0.538..0.561 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Materialize (actual time=4.141..3857.054 rows=999990 loops=1)
-> Index Only Scan using b_filter_id on b (actual time=4.133..3676.393 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
This takes 4 seconds because nearly 1 million rows are read from this inner table. When knowing this, it could be acceptable. But the result after the join is small (even an empty result there) and the user will not understand this long response time.
Checking another execution plan by disabling Merge Join:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Hash Join (actual time=4059.080..4059.080 rows=0 loops=1)
Hash Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=0.665..0.677 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Hash (actual time=3966.577..3966.577 rows=999990 loops=1)
Buckets: 131072 (originally 1024) Batches: 16 (originally 1) Memory Usage: 3471kB
-> Index Only Scan using b_filter_id on b (actual time=4.050..3725.402 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
(10 rows)
The Hash Join takes the same time. I disable it and get a Nested Loop:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Set(enable_hashjoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=0 and b.filter=1;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (actual time=5.743..5.743 rows=0 loops=1)
-> Index Only Scan using a_filter_id on a (actual time=0.722..0.748 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=0.489..0.489 rows=0 loops=10)
Index Cond: ((filter = 1) AND (id = a.id))
Heap Fetches: 0
(7 rows)
This one is fast: 5 milliseconds. This plan is better and probably closer to what the user expects for a small result. This is the advantage of a Nested Loop: the join condition can be applied when scanning the inner table (Index Cond: ((filter = 1) AND (id = a.id))
) which is not the case with the other methods.
For each case, I know what the best plan is. But accepting the incertitude of the query planner estimation, what would you choose between the following?
- Merge Join takes 2 milliseconds with low cardinality but can go to 4 seconds
- Nested Loop takes 4 milliseconds in all case
An optimizer will generally choose the first one, because of the lowest cost estimated. A database vendor will also prefer the first one and publish benchmarks that avoids the misestimates. An application user will prefer the second one, that is always in single digits milliseconds. Predictable performance is the goal.
Should we force Nested Loop always? Let's see what happens when the incertitude is in the outer table.
Larger outer table
The outer row that will be larger with a.filter=1
but few rows from the inner table with a.filter=0
. The join direction (a
to b
) that I force with the Leading hint Leading((a b))
is not the best here, but remember that my goal is to test what happens with bad estimations, and then with bad execution plans.
explain (costs off, analyze, summary off)
/*+ Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=0;
QUERY PLAN
------------------------------------------------------------------------------------------------
Merge Join (actual time=3994.370..3994.370 rows=0 loops=1)
Merge Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.460..3893.289 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Materialize (actual time=0.572..0.605 rows=10 loops=1)
-> Index Only Scan using b_filter_id on b (actual time=0.561..0.585 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
(9 rows)
This takes 4 seconds, the time to read nearly one million rows for the outer table. The join itself adds only 100 milliseconds.
Hash Join is nearly the same:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=0;
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Join (actual time=3977.997..3977.998 rows=0 loops=1)
Hash Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.234..3865.260 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Hash (actual time=2.518..2.518 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using b_filter_id on b (actual time=2.487..2.497 rows=10 loops=1)
Index Cond: (filter = 0)
Heap Fetches: 0
(10 rows)
If a Nested Loop is choosen, because it was the preferred method in from the previous cases, this means 999990 loops:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Set(enable_hashjoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=0;
QUERY PLAN
------------------------------------------------------------------------------------------------
Nested Loop (actual time=264337.144..264337.144 rows=0 loops=1)
-> Index Only Scan using a_filter_id on a (actual time=4.381..2290.456 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=0.252..0.252 rows=0 loops=999990)
Index Cond: ((filter = 0) AND (id = a.id))
Heap Fetches: 0
(7 rows)
This is where Nested Loop can go very bad, 5 minutes here. This is not a surprise because there are 999990 loops and, in a distributed database, this means remote calls.
This is where cardinality misestimate can be very bad. We need to let the optimizer find the right join method, and hope it will not fail. Some databases use Adaptive Joins so that, at execution time, when this is detected it can switch to a hash join.
YugabyteDB has another, proactive, solution to lower the consequences of a bad Nested Loop: run the outer loop on a batch of rows rather than one by one. This can be defined with yb_bnl_batch_size
higher than 1:
set yb_bnl_batch_size=10;
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Set(enable_hashjoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (actual time=32608.808..32608.808 rows=0 loops=1)
Join Filter: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.305..659.195 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=0.301..0.301 rows=0 loops=99999)
Index Cond: ((filter = 0) AND (id = ANY (ARRAY[a.id, $1, $2, $3, $4, $5, $6, $7, $8, $9])))
Heap Fetches: 0
(8 rows)
You can understand the implementation details in the Index Cond
: an array of value is passed for filtering. In traditional databases with B-Tree, this usually prevents the usage of the index as we don't have a single range to read from it. In YugabyteDB, with all tables and indexes stored in LSM-Tree, this is optimized by seeking into the sorted structure.
With batches of 10 rows per loop, the 5 minutes have reduced to 32 seconds. With a larger batch size this is even better:
set yb_bnl_batch_size=100;
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Set(enable_hashjoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------
YB Batched Nested Loop Join (actual time=7038.942..7038.942 rows=0 loops=1)
Join Filter: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.186..475.601 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=0.578..0.578 rows=0 loops=10000)
Index Cond: ((filter = 0) AND (id = ANY (ARRAY[a.id, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14,
$15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $
39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63
, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87,
$88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99])))
Heap Fetches: 0
(8 rows)
With batches of 100 rows, I have only 10000 loops from the million rows in the outer table, and this takes 7 seconds.
Now, thanks to Batched Nested Loop, I have one join method that, even if not always the best one, remains acceptable for all cases. Note that I'm testing here an extreme case where the optimizer would have chosen to start its join by a one million rows table when the other has only 10 rows.
Larger outer and inner
Let's try the worst case, with low selectivity on both side, so that the join direction doesn't matter.
explain (costs off, analyze, summary off)
/*+ Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Merge Join (actual time=8.494..4758.582 rows=999990 loops=1)
Merge Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.226..1521.561 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Materialize (actual time=4.263..2715.448 rows=999990 loops=1)
-> Index Only Scan using b_filter_id on b (actual time=4.252..2493.512 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
This takes again 5 seconds with a Merge Join. Note that without my index sorted on the filter and then on the join column, it would have required additional sorting. I'm running this with the best indexes for this. You can see that the time is proportional to the resultset, with the first row joined in 8 milliseconds.
If Merge Join is not possible:
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Hash Join (actual time=3995.393..8273.822 rows=999990 loops=1)
Hash Cond: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.274..3693.205 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Hash (actual time=3990.359..3990.360 rows=999990 loops=1)
Buckets: 131072 (originally 1024) Batches: 16 (originally 1) Memory Usage: 3471kB
-> Index Only Scan using b_filter_id on b (actual time=4.282..3733.138 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
(10 rows)
The Hash Join takes longer because it has to hash a large rowset, it takes 8 seconds. In addition to that, the first row cannot be joined before the whole inner table is hashed.
Now forcing the Nested Loop (with Batching still set to 100):
explain (costs off, analyze, summary off)
/*+ Set(enable_mergejoin off) Set(enable_hashjoin off) Leading((a b)) IndexOnlyScan(a) IndexOnlyScan(b) */
select id, filter from a join b using (id)
where a.filter=1 and b.filter=1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (actual time=8.074..18138.524 rows=999990 loops=1)
Join Filter: (a.id = b.id)
-> Index Only Scan using a_filter_id on a (actual time=4.299..467.477 rows=999990 loops=1)
Index Cond: (filter = 1)
Heap Fetches: 0
-> Index Only Scan using b_filter_id on b (actual time=1.453..1.637 rows=100 loops=10000)
Index Cond: ((filter = 1) AND (id = ANY (ARRAY[a.id, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87,$88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99])))
Heap Fetches: 0
(8 rows)
This takes 18s. Not the best, but not too far from the other join methods either. Remember that this is the worst possible case.
This means that even if the worst join method is chosen, the query still runs.
Summary
I have created indexes that are optimized for my access pattern: filter conditions and then join condition, and with range sharding to preserve the order. Here is the response time I observed:
outer table | inner table | Merge Join | Hash Join | YB Nested Loop |
---|---|---|---|---|
10 rows | 10 rows | 2ms | 2ms | 4ms |
10 rows | 1 million | 4s | 4s | 4ms |
1 million | 10 rows | 4s | 4s | 7s |
1 million | 1 million | 4s | 8s | 18s |
If you want the maximum execution plan stability possible, there's only one join method to force, the Batched Nested Loop is probably the one that fits the user requirement. When it is more expensive than the other join methods, the difference is only 5x and can be easily understood by the user because the result is one million rows.
If we let the query planner decide for the execution plan, we are still guaranteed to have acceptable response time in all cases. With the right indexes defined, and statistics, and simple queries, the query planner should choose the best plan. But a mistake will not be fatal because all plans are acceptable for all cardinalities (with batched nested loops enabled).
This relies on optimal, covering, indexes, and you will not create the perfect index for each use case. But at least the critical ones should follow this pattern for predictable performance.