Partial Index as an approximate Count(*) Materialized View

Franck Pachot - Dec 7 '22 - - Dev Community

In this new series I'll look at using indexes where you may have thought about a fast refresh on commit materialized view. Because indexes can do a lot more than what you think. This post is about maintaining automatically an approximate aggregate (count, min, max) with fast query and low overhead on DML.

When you want to count the rows of a big tables, it has to scan all rows. Even it the count can be pushed down in YugabyteDB, this is still a lot of work.

I create and load a large table. If you want to try it, you can already create the index that I define later, because it is easier faster to create it before the inserts.

I Load my table with large rows:

drop table if exists mybigtable cascade;

create table mybigtable (
 primary key(id asc)
 , id bigint generated always as identity
 , data text
);

copy mybigtable(data) from program
$bash$
base64 -w 1000 /dev/urandom | head -1000000
$bash$ with ( replace )
\watch

Enter fullscreen mode Exit fullscreen mode

I've canceled it after a while and got a 40GB table (with compression) with 80 tablets (low phase auto-splitting doing its job to distribute the load).

yugabyte=# select pg_size_pretty(pg_table_size('mybigtable'::regclass));

 pg_size_pretty
----------------
 44 GB
(1 row)

yugabyte=# select * from yb_table_properties('mybigtable'::regclass);

 num_tablets | num_hash_key_columns | is_colocated | tablegroup_oid | colocation_id
-------------+----------------------+--------------+----------------+---------------
          83 |                    0 | f            |                |


(1 row)

yugabyte=# \timing on
Timing is on.

yugabyte=#  select count(*) from mybigtable;
  count
----------
 42972000
(1 row)

Time: 85898.780 ms (01:25.899)

Enter fullscreen mode Exit fullscreen mode

A count takes more than one minute. I may prefer a fast approximate count.

As my table is range partitioned, I cannot benefit from the yb_hash_code() push-down to limit the number of tablets to read. But I'll use it as an easy hash function that returns a number from 0 to 65535.

I'm creating a partial index that will be smaller in width, storing only a hash code, and in height, storing an index entry only for 1/65536 original row:

create index mybigtable_approximate_index on mybigtable (
 id asc
) where yb_hash_code(id)=0 ;

create view mybigtable_approximate_agg as 
select 65536*count(*) as count, min(id), max(id)
 from mybigtable where yb_hash_code(id)=0;
Enter fullscreen mode Exit fullscreen mode

The make it easy to use I've created a view on top of it.
This view returns one row with an approximate count, min and max by reading only the small index:

select * from mybigtable_approximate_agg;
Enter fullscreen mode Exit fullscreen mode

This is fast, 2 milliseconds, as it holds only 1/65536th of the rows, 669 here, and only small columns:

yugabyte=# explain (costs off, analyze, verbose)
select * from mybigtable_approximate_agg;

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2.345..2.345 rows=1 loops=1)
   Output: (65536 * count(*)), min(mybigtable.id), max(mybigtable.id)
   ->  Index Only Scan using mybigtable_approximate_index on public.mybigtable (actual time=2.070..2.286 rows=669 loops=1)
         Output: mybigtable.id
         Heap Fetches: 0
 Planning Time: 0.108 ms
 Execution Time: 2.389 ms
 Peak Memory Usage: 18 kB
(8 rows)

yugabyte=#  select * from mybigtable_approximate_agg;
  count   | min  |   max
----------+------+----------
 43843584 | 4443 | 42941580
(1 row)

Time: 14.758 ms
Enter fullscreen mode Exit fullscreen mode

The result is approximate, it has 669 index entries and is multiplied by 65536 to estimate the total number in the table. The error here is 100*(43843584 - 42972000)/42972000=2% and comes from the distribution of yb_hash_code(). This is just an example - you can find other hash function. There are multiple reasons for an approximate count. It can be used to draw a progress bar for an export to file, for example. Or maybe to set the query planner statistics. Even 30% error would not matter when we are interested by the order of magnitude.

As my index is also replicated, I can make it even faster in multi-region deployment when reading from the closest replica with Follower Reads. When I want an approximate result, I can probably accept to miss the rows inserted during the past 30 seconds:

set default_transaction_read_only = on;
set yb_read_from_followers=on;
set yb_follower_read_staleness_ms=3000;
select * from mybigtable_approximate_agg;
Enter fullscreen mode Exit fullscreen mode

This would allow a larger table, and then a smaller margin of error. If you want a precise count,you can maintain it from a trigger but still good to have multiple entries to sum (one per hascode, maybe) to avoid concurrent upgtes of the same counter.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .