By default, the cost based optimizer estimates predicate selectivity by gathering the number of distinct values for each column. When multiple column filters are combined, the query planner assumes no correlation between columns and multiplies the selectivity. This can lead to bad execution plans.
There are two ways to give more information to the PostgreSQL query planner:
- correlated columns within a single table: extended statistics so that the number of distinct values is also gathered for column groups
- correlated columns in joined tables: Optimizer hints so adjust the cardinality estimations
TL;DR: you need both to give all necessary statistics to the query planner optimizer.
An example in PostgreSQL
In a PostgreSQL lab, I load a table with cities and countries
postgres=# create table countries_cities (
id bigint primary key, name text not null,
state_id int not null ,
state_code text not null, state_name text not null,
country_id int not null ,
country_code text not null, country_name text not null,
latitude float not null, longitude float not null,
wikidataid text default null
);
CREATE TABLE
postgres=# \! wget -qc -O /var/tmp/countries_cities.csv https://github.com/dr5hn/countries-states-cities-database/raw/master/csv/cities.csv
postgres=# \copy countries_cities from '/var/tmp/cities.csv' with csv header
COPY 150573
postgres=# vacuum analyze countries_cities;
VACUUM
I plan to query the cities in California and compare the estimated cardinality (cost... rows=
) to the actual one (actual... rows=
) using explain analyze
. This will help us ensure the estimations are accurate for more complex queries. If the estimates are incorrect, the query planner may choose an inefficient execution plan when the result is joined with other tables.
With a single-column predicate (state_code='CA'
), the estimation is correct: 1074 rows estimated, 1233 actual ones:
postgres=# explain (analyze, summary off)
select * from countries_cities
where state_code='CA';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on countries_cities (cost=0.00..4154.16 rows=1074 width=77) (actual time=1.628..14.918 rows=1233 loops=1)
Filter: (state_code = 'CA'::text)
Rows Removed by Filter: 149340
Now, because there may be a 'CA' code in other countries, I query with the country code (state_code='CA' and country_code='US'
) but the optimizer under-estimates the number of rows, 139 instead of 1124 actual ones:
postgres=# explain (analyze, summary off)
select * from countries_cities
where state_code='CA' and country_code='US';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on countries_cities (cost=0.00..4530.59 rows=139 width=77) (actual time=15.577..18.132 rows=1124 loops=1)
Filter: ((state_code = 'CA'::text) AND (country_code = 'US'::text))
Rows Removed by Filter: 149449
The reason is that there are no optimizer statistics that accounts for the correlation between the country code and the state code: for most cities in this table, CA is in US:
postgres=# select count(*), country_code from countries_cities where state_code='CA' group by country_code order by 1;
count | country_code
-------+--------------
1 | CV
1 | BI
2 | SV
2 | NA
2 | MD
2 | GW
9 | NI
17 | LU
29 | UY
44 | ES
1124 | US
You can add redundant predicates, like state_code='CA' and country_code='US' and state_name='California' and country_name='United States'
and you will quickly see the under-estimation showing rows=1
even if there's still 1124 rows. This will generate very bad execution plans for more complex queries.
We need to give more information to the query planner so that the optimizer can know the correlation between those columns.
Single Table: Extended Statistics
When the correlation is in a single table, we can gather extended statistics with CREATE STATISTICS and ANALYZE:
postgres=# create statistics counties_cities_distinct (ndistinct)
on state_code , country_code, state_name, country_name
from countries_cities;
CREATE STATISTICS
postgres=# analyse countries_cities;
ANALYZE
Those are visible in pg_stat_ext
:
postgres=# \x
postgres=# select statistics_name, attnames, exprs, kinds, n_distinct
from pg_stats_ext where tablename='countries_cities' ;
-[ RECORD 1 ]---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
statistics_name | counties_cities_distinct
attnames | {state_code,state_name,country_code,country_name}
exprs |
kinds | {d}
n_distinct | {"4, 5": 2388, "4, 7": 2395, "4, 8": 2395, "5, 7": 2394, "5, 8": 2394, "7, 8": 190, "4, 5, 7": 2395, "4, 5, 8": 2395, "4, 7, 8": 2395, "5, 7, 8": 2394, "4, 5, 7, 8": 2395}
postgres=> select attnum, attname from pg_attribute
where attrelid='countries_cities'::regclass
and attname in ('state_code','country_code','state_name','country_name')
order by 1;
attnum | attname
--------+--------------
4 | state_code
5 | state_name
7 | country_code
8 | country_name
(4 rows)
This tells the optimizer that there are approximately (this comes from a sample) 190 distinct values for the combination of (country_code, country_name)
, which is the number of distinct countries, 2388 distinct values for the combination of (state_code, state_name)
, which is the number of distinct states, and 2394 or 2395 distinct values when combining attributes from states and countries, accounting for the same state names in multiple countries.
Normalization to two tables
Except if it is a datawarehouse dimension, you will rarely find this in a single table. Let's normalize this to two tables
postgres=# create table countries as
select distinct country_id , country_code as country_code, country_name from countries_cities;
SELECT 198
postgres=# create statistics countries_distinct (ndistinct)
on country_code, country_name from countries;
CREATE STATISTICS
postgres=# create table cities as
select id, name, country_id, state_code, state_name, latitude, longitude, wikidataid from countries_cities;
SELECT 150573
postgres=# create statistics cities_state_distinct (ndistinct)
on state_code, state_name from cities;
CREATE STATISTICS
postgres=# alter table countries add primary key (country_id);
ALTER TABLE
postgres=# alter table cities add primary key (id);
ALTER TABLE
postgres=# alter table cities add foreign key (country_id) references countries(country_id);
ALTER TABLE
postgres=# analyze countries, cities;
ANALYZE
I can query those tables with a join. With a single-column predicate, the estimation (rows=1144
) is close to the actual result (rows=1233
):
postgres=# explain (analyze, summary off)
select * from countries join cities using(country_id)
where state_code='CA';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Hash Join (cost=6.46..3555.69 rows=1144 width=62) (actual time=1.675..15.134 rows=1233 loops=1)
Hash Cond: (cities.country_id = countries.country_id)
-> Seq Scan on cities (cost=0.00..3546.16 rows=1144 width=50) (actual time=1.620..14.797 rows=1233 loops=1)
Filter: (state_code = 'CA'::text)
Rows Removed by Filter: 149340
-> Hash (cost=3.98..3.98 rows=198 width=16) (actual time=0.048..0.048 rows=198 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on countries (cost=0.00..3.98 rows=198 width=16) (actual time=0.007..0.020 rows=198 loops=1)
However with a predicate on each table, the query planner under-estimates the cardinality: 6 rows instead of 1124 rows:
postgres=# explain (analyze, summary off)
select * from countries join cities using(country_id)
where state_code='CA' and country_code='US';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..3564.94 rows=6 width=62) (actual time=12.466..14.660 rows=1124 loops=1)
Join Filter: (countries.country_id = cities.country_id)
Rows Removed by Join Filter: 109
-> Seq Scan on countries (cost=0.00..4.47 rows=1 width=16) (actual time=0.018..0.028 rows=1 loops=1)
Filter: (country_code = 'US'::text)
Rows Removed by Filter: 197
-> Seq Scan on cities (cost=0.00..3546.16 rows=1144 width=50) (actual time=1.537..14.464 rows=1233 loops=1)
Filter: (state_code = 'CA'::text)
Rows Removed by Filter: 149340
Can we create extended statistics? The syntax exists but is not implemented (I'm running this in PostgreSQL 16):
postgres=# create statistics x (ndistinct) on state_code , country_code
from countries join cities using(country_id);
ERROR: 0A000: only a single relation is allowed in CREATE STATISTICS
LOCATION: ProcessUtilitySlow, utility.c:1885
Time: 0.781 ms
When the correlation is between two joined tables, we need something else and this is where optimizer hints can solve the problem.
Across joins: Rows() Hint
There's no optimizer hints in PostgreSQL but we can use pg_hint_plan extension. As it is installed by default in YugabyteDB (which is a PostgreSQL fork with horizontal scalability), I'm running the same in my 3 nodes lab after enabling the Cost Based Optimizer, which is not yet set as the default in YugabyteDB 2.19.3
yugabyte=# set yb_enable_optimizer_statistics=on;
SET
yugabyte=# set yb_enable_base_scans_cost_model=on;
SET
Here is the query with the optimizer mis-estimate (result estimated to 6 rows instead of 1124):
yugabyte=# explain (analyze, summary off)
select * from countries join cities using(country_id)
where state_code='CA' and country_code='US';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=11.25..94944.27 rows=6 width=62) (actual time=29.371..30.045 rows=1124 loops=1)
Join Filter: (countries.country_id = cities.country_id)
Rows Removed by Join Filter: 109
-> Seq Scan on countries (cost=3.83..148.58 rows=1 width=16) (actual time=1.190..1.192 rows=1 loops=1)
Remote Filter: (country_code = 'US'::text)
-> Seq Scan on cities (cost=7.42..94781.45 rows=1139 width=50) (actual time=28.173..28.521 rows=1233 loops=1)
Remote Filter: (state_code = 'CA'::text)
I know where the misestimation comes from: the selectivity of state_code='CA'
has been multiplied by the selectivity of country_code='US'
as if each state code had cities in each countries. The knowledge I have about this data is that there's usually only one country. Then can multiply the query planner estimation by the number of countries, which is 198. It is actually a bit less, as some other countries have cities in a 'CA' state for a country that is not 'US'. Anyway, this is approximative. As I've seen in the previous test that the average correlation is 190 I'll use it, hinting with Rows( countries cities *190 )
:
yugabyte=# explain (analyze, summary off)
/*+ Rows( countries cities *190 ) */
select * from countries join cities using(country_id)
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=133.25..79104.23 rows=28608870 width=62) (actual time=4.083..148.959 rows=150573 loops=1)
Hash Cond: (cities.country_id = countries.country_id)
-> Seq Scan on cities (cost=7.42..78574.69 rows=150573 width=50) (actual time=1.753..95.049 rows=150573 loops=1)
-> Hash (cost=123.36..123.36 rows=198 width=16) (actual time=2.310..2.310 rows=198 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on countries (cost=3.83..123.36 rows=198 width=16) (actual time=0.663..2.209 rows=198 loops=1)
When joining tables, you can use the Rows()
hint to adjust the estimated number of rows after joining the tables. Provide aliases for the tables whose rows you want to correct and the correction. You can use #
to set an absolute value, but *
is better to use as a multiplication factor that will still be correct when the tables grow. The number doesn't have to be exact. For instance, if you join another table from this example, a misestimate of 6 rows instead of 1124 could lead to a bad decision for a Nested Loop join, which would execute thousands of loops, but an estimation of 1000 or 2000 will not be problematic. In such cases, optimizer hints are the only way to provide good input to the query planner.