Some extensions are very easy to use in YugabyteDB, when they are bundled and accessible with a simple CREATE EXTENSION
and that's the case for hll
which provides HyperLogLog functions that can provide a fast approximate discount count:
yugabyte=# \x
Expanded display is on.
yugabyte=# select * from pg_available_extensions
where comment ilike '%hyperloglog%';
-[ RECORD 1 ]-----+----------------------------------
name | hll
default_version | 2.16
installed_version |
comment | type for storing hyperloglog data
yugabyte=# \x
Expanded display is off.
yugabyte=# create extension if not exists hll;
CREATE EXTENSION
yugabyte=# \dx+ hll
Objects in extension "hll"
Object description
-------------------------------------------------------------------
cast from bigint to hll_hashval
cast from bytea to hll
cast from hll to hll
cast from integer to hll_hashval
function hll(hll,integer,boolean)
function hll_add(hll,hll_hashval)
function hll_add_agg(hll_hashval)
function hll_add_agg(hll_hashval,integer)
function hll_add_agg(hll_hashval,integer,integer)
...
Example
I create a table with 10 million rows, the id
is unique, with 10 million distinct values, val1
has only 42 distinct values, and val2
has one million distinct values:
yugabyte=# create table demo (
id bigint primary key, val1 int, val2 int
);
CREATE TABLE
yugabyte=# insert into demo
select n, mod(n,42) val1 , mod(n,1e6) val2
from generate_series(1,1e7) n;
INSERT 0 10000000
Here is the exact number of distinct values for those 3 columns:
yugabyte=# \timing on
yugabyte=# select
count(distinct id)
,count(distinct val1)
,count(distinct val2)
from demo
;
count | count | count
----------+-------+---------
10000000 | 42 | 1000000
(1 row)
Time: 28592.003 ms (00:28.592)
This is exact but it takes time. Let's use the HyperLogLog functions.
hll_cardinality
The idea is to build a hash from the values (hll_hash_bigint
) and aggregate it (hll_add_agg
) and then count (hll_cardinality
):
yugabyte=# select
hll_cardinality(hll_add_agg(hll_hash_bigint(id)))
,hll_cardinality(hll_add_agg(hll_hash_bigint(val1)))
,hll_cardinality(hll_add_agg(hll_hash_bigint(val2)))
from demo;
hll_cardinality | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
9656451.96827238 | 42 | 995263.714893331
(1 row)
Time: 20837.501 ms (00:20.838)
This gives an approximate result and is a bit faster. The interesting thing is that the approximation is proportional to the result: I've an exact result for 42, a 0.5% approximation for 1 million, and 3.5% for 10 millions.
The reason for the difference is visible in the execution plan, especially with YugabyteDB which shows the Peak Memory Usage:
yugabyte=# explain (costs off, analyze, buffers)
select
count(distinct id)
,count(distinct val1)
,count(distinct val2)
from demo
;
QUERY PLAN
-----------------------------------------------------------------------------
Aggregate (actual time=28207.491..28207.491 rows=1 loops=1)
Buffers: temp read=90975 written=91146
-> Seq Scan on demo (actual time=2.592..19544.667 rows=10000000 loops=1)
Planning Time: 5.891 ms
Execution Time: 28210.625 ms
Peak Memory Usage: 18619 kB
(6 rows)
Time: 28249.986 ms (00:28.250)
In all cases, it takes 19 seconds to scan all rows (actual time=2.592..19544.667 rows=10000000
) but then they go to memory to be aggregated. The Peak Memory Usage
is 18 MB. The work_mem
is not sufficient and it has to spill to temporary files on disk: 700MB have been read and written (read=90975 written=91146
are 8k buffers). This takes additional 9 seconds here.
Let's do the same with the appoximate count:
yugabyte=# explain (costs off, analyze, buffers)
select
hll_cardinality(hll_add_agg(hll_hash_bigint(id)))
,hll_cardinality(hll_add_agg(hll_hash_bigint(val1)))
,hll_cardinality(hll_add_agg(hll_hash_bigint(val2)))
from demo;
QUERY PLAN
-----------------------------------------------------------------------------
Aggregate (actual time=20028.517..20028.517 rows=1 loops=1)
-> Seq Scan on demo (actual time=2.180..18538.887 rows=10000000 loops=1)
Planning Time: 2.562 ms
Execution Time: 20028.832 ms
Peak Memory Usage: 330 kB
(5 rows)
Time: 20049.146 ms (00:20.049)
The time to scan the table is the same but now, thanks to the HyperLogLog aggregation, the memory usage is small (Peak Memory Usage: 330 kB
) and there's no need to read and write to temporary files. This explains how it is faster than the exact count.
Maintained aggregate
If I want to reduce it further, I can pre-build a table with the HyperLogLog aggregation:
yugabyte=# create table demo_hll (
hash_code int primary key
, id_hll hll, val1_hll hll, val2_hll hll
);
CREATE TABLE
Time: 271.705 ms
yugabyte=# insert into demo_hll select
yb_hash_code(id)%16
,hll_add_agg(hll_hash_bigint(id))
,hll_add_agg(hll_hash_bigint(val1))
,hll_add_agg(hll_hash_bigint(val2))
from demo
group by yb_hash_code(id)%16
;
INSERT 0 16
Time: 25106.458 ms (00:25.106)
Because I will update this table for each insert I've created 16 buckets to keep it scalable. The table is still small:
yugabyte=# select pg_size_pretty(pg_table_size('demo_hll'));
pg_size_pretty
----------------
265 MB
(1 row)
The hll
from different buckets can be combined with hll_union_agg
to get the hll_cardinality
:
yugabyte=# select
hll_cardinality(hll_union_agg(id_hll))
,hll_cardinality(hll_union_agg(val1_hll))
,hll_cardinality(hll_union_agg(val2_hll))
from demo_hll;
hll_cardinality | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
9656451.96827238 | 42 | 995263.714893331
(1 row)
Time: 12.453 ms
yugabyte=# explain (costs off, analyze)
select
hll_cardinality(hll_union_agg(id_hll))
,hll_cardinality(hll_union_agg(val1_hll))
,hll_cardinality(hll_union_agg(val2_hll))
from demo_hll;
QUERY PLAN
-----------------------------------------------------------------------
Aggregate (actual time=2.220..2.220 rows=1 loops=1)
-> Seq Scan on demo_hll (actual time=0.413..1.921 rows=16 loops=1)
Planning Time: 1.997 ms
Execution Time: 2.283 ms
Peak Memory Usage: 462 kB
(5 rows)
The approximation is still good even from multiple buckets but this would have no interest if it needed to be rebuilt each time we want a fresh count. The hll
extension provides hll_add
to update the HyperLogLog structures when adding new values, and I can use a trigger do do it automatically for new inserts:
CREATE OR REPLACE FUNCTION demo_hll()
RETURNS TRIGGER AS
$$
BEGIN
UPDATE demo_hll
SET id_hll = hll_add(id_hll , hll_hash_bigint(NEW.id)),
val1_hll = hll_add(val1_hll , hll_hash_bigint(NEW.val1)),
val2_hll = hll_add(val2_hll , hll_hash_bigint(NEW.val2))
WHERE hash_code = yb_hash_code(NEW.id)%16;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER demo_trigger
AFTER INSERT ON demo
FOR EACH ROW
EXECUTE FUNCTION demo_hll();
Now I test with some inserts to see how the approximate distinct count is accurate from the aggreagate:
yugabyte=# insert into demo
select n+1e7, mod(n,99) val1 , n val2
from generate_series(1,1e3) n;
INSERT 0 1000
Time: 2048.925 ms (00:02.049)
yugabyte=# select
hll_cardinality(hll_union_agg(id_hll))
,hll_cardinality(hll_union_agg(val1_hll))
,hll_cardinality(hll_union_agg(val2_hll))
from demo_hll;
hll_cardinality | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
9656451.96827238 | 99 | 995263.714893331
(1 row)
The approximate count has been updated (I inserted 99 different integer instead of 42 before).
Fast exact count distinct with YugabyteDB
YugabyteDB has an optimisation for DISTINCT: it is pushed down to the storage, and doesn't need to scan all rows because it can skip entries in the LSM-Tree. This works with range-sharding indexes:
create index demo_hll_id on demo (id asc);
create index demo_hll_val1 on demo (val1 asc);
create index demo_hll_val2 on demo (val2 asc);
To benefit from the skip scan, I need to force an Index Scan, and I do that by adding a ... >= (select min(...) from ...)
:
yugabyte=# select count(*) from (
select distinct id from demo
where id>=(select min(id) from demo)
) distinct_pushdown ;
count
----------
10001000
(1 row)
Time: 17707.842 ms (00:17.708)
yugabyte=# select count(*) from (
select distinct val2 from demo
where val2>=(select min(val2) from demo)
) distinct_pushdown ;
count
---------
1000000
(1 row)
Time: 6901.541 ms (00:06.902)
The DISTINCT is pushed down to DocDB, the distributed storage, but all distinct values are fetched to be counted in YSQL, the postgres backend. The less distinct values, and the faster it is:
yugabyte=# select count(*) from (
select distinct val1 from demo
where val1>=(select min(val1) from demo)
) distinct_pushdown ;
Time: 17724.416 ms (00:17.724) count
-------
99
(1 row)
Time: 15.813 ms
This one is really fast. When you know that there are not a lot of distinct values and you have an ascending or descending index on it, it is the fastest and returns an exact value.
Analyze
To help the query planner to estimate predicate selectivity (the cost based optimizer is still in beta in Yugabyte 2.19 and requires to ANALYZE the tables and setting yb_enable_optimizer_statistics=on
) the number of distinct values is stored in the catalog, visible in pg_stats
:
yugabyte=# analyze demo;
WARNING: 'analyze' is a beta feature!
LINE 1: analyze demo;
^
HINT: Set 'ysql_beta_features' yb-tserver gflag to true to suppress the warning for all beta features.
ANALYZE
Time: 10054.772 ms (00:10.055)
yugabyte=# select schemaname, tablename, attname, n_distinct from pg_stats;
schemaname | tablename | attname | n_distinct
------------+-----------+---------+------------
public | demo | id | -1
public | demo | val1 | 45
public | demo | val2 | 972706
(3 rows)
Time: 15.473 ms
The values are approximated, gathered from a sample. The positive values of n_distinct
is the estimation of distinct values. The negative ones are divided by the number of rows (this helps the optimizer to get an accurate picture even when the number of rows increase). A value of -1
means that all rows are distinct.
To summarize...
Queries that need to aggregate rows in an OLTP database should be designed to get the best balance between the exact value and the best performance. YugabyteDB has many optimizations in the distributed storage and knowing them helps to get the best performance, like an Index Scan that can skip to read only the distinct values. When reading many rows is necessary, we rarely need an exact result on a database where data is always updated, and HyperLogLog functions help to reduce the memory and cpu usage when deduplicating the result.