Fast SELECT COUNT(*) WHERE in YugabyteDB

Franck Pachot - Oct 22 '23 - - Dev Community

You see a Full Table Scan when doing a SELECT COUNT(*) and wonder if you can do better. Full Table Scan (Seq Scan in PostgreSQL and YugabyteDB execution plans) is not evil in SQL databases. If you want to count all the rows from a table, it makes sense to read all of them, and that's where a Full Table Scan is efficient.
Let's look at an example on YugabyteDB.

In a lab, I create a demo table with an a primary key "id", a "status" and a filler column to get a row width similar to common tables. I filled this table with 170 million rows.

create extension if not exists orafce;
create table demo ( 
 primary key(id),
 id bigserial, status int, value text
);

insert into demo (status, value)
select dbms_random.value(0,2)::int , dbms_random.string('p',100)
 from generate_series(1,100000)
\watch
-- stopped after 1698 iterations
Enter fullscreen mode Exit fullscreen mode

I ANALYZE the table to check the number of rows and the distribution of the "status" column that I filled with random 0, 1, or 2:


yugabyte=# \x
Expanded display is on.
yugabyte=# select pg_size_pretty(pg_table_size('demo'));
-[ RECORD 1 ]--+------
pg_size_pretty | 21 GB

yugabyte=# analyze demo;
ANALYZE

yugabyte=# select relname, reltuples
           from pg_class where oid='demo'::regclass;

-[ RECORD 1 ]--------
relname   | demo
reltuples | 2.185e+08

yugabyte=# select * from pg_stats 
           where schemaname='public' and tablename='demo'
           and attname='status';

-[ RECORD 1 ]----------+---------------------------
schemaname             | public
tablename              | demo
attname                | status
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 3
most_common_vals       | {1,2,0}
most_common_freqs      | {0.502367,0.2499,0.247733}
histogram_bounds       |
correlation            | 0.380514
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

yugabyte=# \x
Expanded display is off.

yugabyte=# select * from yb_table_properties('demo'::regclass);
 num_tablets | num_hash_key_columns | is_colocated | tablegroup_oid | colocation_id
-------------+----------------------+--------------+----------------+---------------
          24 |                    1 | f            |                |
(1 row)

Enter fullscreen mode Exit fullscreen mode

I have a 20 GB table with 169800000 rows distributed into 24 tablets. I have 54620846 rows with status=0 and I'll check the execution plan to count those rows.

I'm running this on YugabyteDB 2.19.2 on a 3 nodes server, with all default options:


yugabyte=# select version();
                                                                                            version                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.19.2.0-b0 on aarch64-unknown-linux-gnu, compiled by clang version 15.0.7 (https://github.com/yugabyte/llvm-project.git 6b9d30d80f5cebc66c9fa46013d18f125ea8ec0e), 64-bit
(1 row)

yugabyte=# select cloud, region, zone, uuid from yb_servers();
 cloud |  region   |    zone    |               uuid
-------+-----------+------------+---------------------------------
 aws   | eu-west-1 | eu-west-1b | fceb7c2d838849698f727e536028e569
 aws   | eu-west-1 | eu-west-1a | b681f2627a6c4b3893121c7678b1b7bb
 aws   | eu-west-1 | eu-west-1c | dd86768cea534fc6920bd0bb42884b55
(3 rows)

yugabyte=# \dconfig yb*push*
      List of configuration parameters
             Parameter              | Value
------------------------------------+-------
 yb_enable_distinct_pushdown        | on
 yb_enable_expression_pushdown      | on
 yb_enable_index_aggregate_pushdown | on
 yb_enable_sequence_pushdown        | on
 yb_pushdown_is_not_null            | on
 yb_pushdown_strict_inequality      | on
(6 rows)

Enter fullscreen mode Exit fullscreen mode

I verified the pushdown options because those are those that help with aggregate function: we want to distribute the aggregation to the multiple tablet server rather than fetching all rows to count them in the single PostgreSQL backend process.

Seq Scan

My goal is to count the number of rows with status=0

Here is the execution plan:

yugabyte=# explain (analyze, dist, costs off, summary off)
           select count(*) from demo where status=0;

                                 QUERY PLAN
--------------------------------------------------------------------------
 Finalize Aggregate (actual time=22610.905..22610.905 rows=1 loops=1)
   ->  Seq Scan on demo (actual time=5352.550..22610.866 rows=24 loops=1)
         Remote Filter: (status = 0)
         Storage Table Read Requests: 4
         Storage Table Read Execution Time: 22610.505 ms
         Partial Aggregate: true
(6 rows)



Enter fullscreen mode Exit fullscreen mode

Seq Scan is a full table scan but with many optimisations in YugabyteDB.

  • The filtering expression has been pushed down as a Remote Filter applied on rows as soon as they are read in each tablet.
  • In addition to it, each tablet has counted the rows and sent only the result (Partial Aggregate) to be summed by the PostgreSQL backend (Finalize Aggregate).
  • An additional optimization is the parallelization of those scans and partial aggregates. I have 24 tablets and that's why there are rows=24 for the partial result. I see Storage Table Read Requests: 4 then each has scanned 6 tablets in parallel, which is the default for a 3 nodes cluster.

Index Only Scan

Here 75% of the rows were read locally by each tablets and discarded by the status=0 condition. With block-based traditional databases using B-Tree, an index scan is not faster because of the index fragmentation. However, with YugabyteDB LSM-Tree an index scan may still help to read a range containing 25% of the rows.

Let's create such index and run the same query:


yugabyte=# create index demo_status on demo(status);
CREATE INDEX

yugabyte=# select * from yb_table_properties('demo_status'::regclass);
 num_tablets | num_hash_key_columns | is_colocated | tablegroup_oid | colocation_id
-------------+----------------------+--------------+----------------+---------------
           6 |                    1 | f            |                |
(1 row)

yugabyte=# explain (analyze, dist, costs off, summary off)
           select count(*) from demo where status=0;

                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual time=16550.126..16550.126 rows=1 loops=1)
   ->  Index Only Scan using demo_status on demo (actual time=16550.112..16550.116 rows=1 loops=1)
         Index Cond: (status = 0)
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 16550.016 ms
         Partial Aggregate: true
(7 rows)
Enter fullscreen mode Exit fullscreen mode

YugabyteDB shows many optimizations here.

  • The index scan, pushes down the Index Cond as it is not only used to filter but also to read only the range that contains the condition.
  • To count the rows, no additional columns were necessary to read, the Index Only Scan doesn't have to go to the table.
  • As with the Seq Scan, the count was pushed down with Partial Aggregate calculated close to the data that is read.
  • The index being sharded on on the status, there's only one tablet to read, in one Storage Index Read Request

Sharding considerations for parallel scans

In the previous case, because I filter on status with an equality predicate, and because it is the hash key, only one tablet holds the value and then the scan is not parallelized. Such HASH index on a column with few values may lead to large tablets because they are not split further.

If I have a range index with more columns in the keys, it will be split on the values of all key columns. Many tablets can hold the status=0 index entries. Here is an example where I add id and the index has 16 tablets:

yugabyte=#  create index demo_status on demo(status asc, id);
CREATE INDEX

yugabyte=# select * from yb_table_properties('demo_status'::regclass);
 num_tablets | num_hash_key_columns | is_colocated | tablegroup_oid | colocation_id
-------------+----------------------+--------------+----------------+---------------
          16 |                    0 | f            |                |
(1 row)

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual time=4626.779..4626.780 rows=1 loops=1)
   ->  Index Only Scan using demo_status on demo (actual time=4625.380..4626.768 rows=4 loops=1)
         Index Cond: (status = 0)
         Heap Fetches: 0
         Storage Index Read Requests: 3
         Storage Index Read Execution Time: 4626.597 ms
         Partial Aggregate: true
(7 rows)
Enter fullscreen mode Exit fullscreen mode

The response time is faster because multiple tablets have been read in parallel.

This was tested YugabyteDB 2.19.2 (#15857 YSQL: support IndexScan aggregate pushdown)

In Summary

Do not fear Seq Scan when you have to read lot of rows, especially with expression pushdown (Remote Filter). An index can help to read less rows, and do not hesitate to add more column so that the index is used by other queries. In addition to Remote Filter or Index Cond you should check that the aggregation is pushed down (Partial Aggregate: true) to avoid sending lot of rows to the PostgreSQL backend you are connected to. The execution plan tells you all about the query time complexity and scalability.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .