Filtering on DENSE_RANK() optimized as pushed-down DISTINCT in YugabyteDB

Franck Pachot - Apr 2 '23 - - Dev Community

The SQL analytic functions (aka window functions) are powerful and you may use them to filter the first row in a window. However, the analytic function is processed in a subquery to be filtered later, which may prevent some optimizations in data access. I will show how the same can be done more efficiently by filtering during the Index Scan.

Here is a simple table with observation type, name, text and timestamp:

drop table observations;
create extension if not exists orafce;
create table observations (
 primary key ( observation_type asc,observation_name asc, observation_date desc )
 , observation_type text
 , observation_name text
 , observation_date timestamptz
 , observation_text text
);
insert into observations select 
dbms_random.string('U',1),dbms_random.string('U',2),clock_timestamp(), dbms_random.string('L',100) from generate_series(1,1000);
\watch 0.01
Enter fullscreen mode Exit fullscreen mode

DENSE_RANK() = 1

I want the last value for each observation in a specific type.
Here is the execution plan when using DENSE_RANK() and filtering on the rank number 1:

explain (analyze, dist, costs off)
select o."observation_type", o."observation_name", o."observation_text" 
from (
 select *, dense_rank () over (
  partition by o."observation_type", o."observation_name" 
  order by o."observation_date" desc
 ) as rank_number from observations o
 where o."observation_type" = 'Z'
) as o where "o".rank_number=1
;


                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Subquery Scan on o (actual time=3.176..466.641 rows=143 loops=1)
   Filter: (o.rank_number = 1)
   Rows Removed by Filter: 162857
   ->  WindowAgg (actual time=3.174..456.372 rows=163000 loops=1)
         ->  Index Scan using observations_pkey on observations o_1 (actual time=3.156..342.562 rows=163000 loops=1)
               Index Cond: (observation_type = 'Z'::text)
               Storage Index Read Requests: 160
               Storage Index Execution Time: 228.001 ms
 Planning Time: 0.103 ms
 Execution Time: 466.741 ms
 Storage Read Requests: 160
 Storage Write Requests: 0
 Storage Execution Time: 228.001 ms
 Peak Memory Usage: 491 kB
(14 rows)
Enter fullscreen mode Exit fullscreen mode

This had read rows=163000 rows from the distributed storage with Storage Index Read Requests: 160 network calls and then the postgres backend has discarded Rows Removed by Filter: 162857 from those rows to keep only rows=143 that verify the filter.

DISTINCT and scalar subquery

This could be optimized by reading only the required rows, limiting the network calls, transfer, and further single-process filtering.

This is possible on this index thanks to YugabyteDB hybrid scan but, at least in this version (YB-2.17), the filtering on the window function is not pushed down to the index scan.

To allow the hybrid scan optimization, I'll use DISTINCT in a subquery to replace the PARTITION BY of the analytic window. Then, for each window I'll get the last value with a scalar subquery doing the ORDER BY and LIMIT to get the first row for each, which is equivalent to DENSE_RANK()=1

explain (analyze, dist, costs off)
select  d."observation_type", d."observation_name" , (
 select observation_text 
 from observations 
 where (   observation_type ,    observation_name )
      =(d."observation_type", d."observation_name")
 order by observation_date desc
 limit 1
) from (
 -- get all distinct values with push down optimization
 select distinct o."observation_type", o."observation_name"
 from observations o
 -- use range condition to force an index scan
 where o."observation_type" between 'Z' and 'Z'
) d
;

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Subquery Scan on d (actual time=1.757..37.104 rows=143 loops=1)
   ->  Unique (actual time=1.392..1.695 rows=143 loops=1)
         ->  Index Scan using observations_pkey on observations o_1 (actual time=1.391..1.607 rows=143 loops=1)
               Index Cond: ((observation_type >= 'Z'::text) AND (observation_type <= 'Z'::text))
               Storage Index Read Requests: 1
               Storage Index Execution Time: 0.000 ms
   SubPlan 1
     ->  Limit (actual time=0.246..0.246 rows=1 loops=143)
           ->  Index Scan using observations_pkey on observations (actual time=0.237..0.237 rows=1 loops=143)
                 Index Cond: ((observation_type = d.observation_type) AND (observation_name = d.observation_name))
                 Storage Index Read Requests: 1
                 Storage Index Execution Time: 0.224 ms
 Planning Time: 1.614 ms
 Execution Time: 37.187 ms
 Storage Read Requests: 144
 Storage Write Requests: 0
 Storage Execution Time: 32.000 ms
 Peak Memory Usage: 56 kB
(18 rows)
Enter fullscreen mode Exit fullscreen mode

I have added between 'Z' and 'Z' because in this YugabyteDB version an Index Scan would not be used with = 'Z'. It is always important to check the execution plan when doing such optimization, and add a comment in the query.

DISTINCT and LATERAL join

When you need more than one column, it is better to query from a LATERAL join rather than a scalar subquery:

explain (analyze, dist, costs off)
select  o.*  from (
 -- get all distinct values with push down optimization
 select distinct o."observation_type", o."observation_name"
 from observations o
 -- use range condition to force an index scan
 where o."observation_type" between 'Z' and 'Z'
) d , lateral (
-- get additional information for each distinct value
select * from observations
where (   observation_type ,    observation_name )
      =(d."observation_type", d."observation_name")
order by observation_date desc
limit 1
) o
;
Enter fullscreen mode Exit fullscreen mode

The execution plan is similar. The most important is to verify the absence of Rows Removed by Filter, the Index Scan or Index Only Scan for both subqueries, the small Storage Index Read Requests for the Unique branch, and the total Storage Read Requests matching the rows= for the Unique branch.

This works in YugabyteDB thanks to Hybrid Scan (aka Skip Scan, aka Loose Index Scan). You can't use that in PostgreSQL.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .