Partitioning is an option to split a table in physical different objects, while the table remains a logical single entity. Partitioning is a PostgreSQL feature, which we piggybacked on with YugabyteDB to implement features like having the ability to dedicate locations to specific partitions.
Recently we came across a situation where we found an index on the partitioned table only to be used when the filters exclusively would use one partition. If more than one partition was accessed, the plan would revert to seq scans of the partitions, which means these table partitions were scanned. In most cases an index scan requires less work and therefore is faster.
Let's setup a synthetic test case. The issue described here applies to both PostgreSQL and YugabyteDB. The creation code indicates anything YugabyteDB specific, and tells how to change it to work on PostgreSQL:
Setup objects
Create tablespaces
-- The WITH clause is YugabyteDB specific
-- This can be skipped when testing in PostgreSQL
create tablespace tablespace1 WITH (replica_placement='{
"num_replicas": 1, "placement_blocks": [{"cloud":"local","region":"local","zone":"local1","min_num_replicas":1}]
}');
create tablespace tablespace2 WITH (replica_placement='{
"num_replicas": 1, "placement_blocks":[{"cloud":"local","region":"local","zone":"local2","min_num_replicas":1}]
}');
create tablespace tablespace3 WITH (replica_placement='{
"num_replicas": 1, "placement_blocks":[{"cloud":"local","region":"local","zone":"local3","min_num_replicas":1}]
}');
Create tablespace allows placement of database files in PostgreSQL, in YugabyteDB it follows that principle. For PostgreSQL this means you can move table storage to specific mount points.
For YugabyteDB, it allows to dedicate tablet servers and setting the replication factor for the storage using num_replicas
, which it allows changing it from the general set replication factor.
The 'cloud', 'region' and 'zone' properties are the properties of the tablet servers (and beyond the scope of this article). My test setup has one tablet server for each of the zone's of 'local1', 'local2' and 'local3'.
If you want to replay this on PostgreSQL, you can skip tablespace creation.
Create table with partitioning specified
-- Change the comment to YugabyteDB for PostgreSQL
create table partition_test(
id int,
locality varchar,
f1 varchar,
f2 varchar,
primary key(id hash, locality asc) -- YugabyteDB
--primary key(id asc, locality asc) -- PostgreSQL
) partition by list(
locality
);
Add the partitions to the table with partitioning
-- This works with PostgreSQL
-- Comment out the tablespace clause if you didn't create tablespaces.
create table partition_test_local1
partition of partition_test(
id,
locality,
f1,
f2
) for values in ('local1')
tablespace tablespace1
;
create table partition_test_local2
partition of partition_test(
id,
locality,
f1,
f2
) for values in ('local2')
tablespace tablespace2
;
create table partition_test_local3
partition of partition_test(
id,
locality,
f1,
f2
) for values in ('local3')
tablespace tablespace3
;
Insert test data
insert into partition_test values
(1,'local1','aaa','bbb'),
(2,'local2','aaa','ccc'),
(3,'local3','bbb','ddd'),
(4,'local1','bbb','eee'),
(5,'local2','ccc','fff'),
(6,'local3','ccc','ggg');
Tests
No filter: all partitions seq scan
Let's see how the explain plan of a scan on a partitioned table looks like:
yugabyte=# explain select * from partition_test;
QUERY PLAN
---------------------------------------------------------------------------------
Append (cost=0.00..299.00 rows=3000 width=100)
-> Seq Scan on partition_test_local1 (cost=0.00..94.00 rows=1000 width=100)
-> Seq Scan on partition_test_local2 (cost=0.00..95.00 rows=1000 width=100)
-> Seq Scan on partition_test_local3 (cost=0.00..95.00 rows=1000 width=100)
The first thing to notice is that the Append
row source/plan node performs the gathering of the different results into a single result, and can obtain results from more than one or two results.
Filter on a subset of the primary key
Now let's apply a filter on the id
column:
yugabyte=# explain select id from partition_test where id = 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Append (cost=0.00..45.65 rows=300 width=4)
-> Index Scan using partition_test_local1_pkey on partition_test_local1 (cost=0.00..14.65 rows=100 width=4)
Index Cond: (id = 1)
-> Index Scan using partition_test_local2_pkey on partition_test_local2 (cost=0.00..14.75 rows=100 width=4)
Index Cond: (id = 1)
-> Index Scan using partition_test_local3_pkey on partition_test_local3 (cost=0.00..14.75 rows=100 width=4)
Index Cond: (id = 1)
It might seem weird that all partitions are scanned for finding value 1 in the id
column. However, please recall that the primary key was set on both the id
column as well as the locality
column, and therefore the index cannot guarantee uniqueness of the id
column alone, and thus has to scan the entire (partitioned) table.
The reason for mentioning this is that a primary key on a partitioned table must include all the partition key columns (see the PostgreSQL documentation chapter 5.10.2.3 limitations).
This means it might be tempting to think filtering on what normally would be the primary key (id
here) would use a primary key lookup.
Filter on a subset of the primary key including the partition key
If we apply the filter to the partition key (the locality field):
yugabyte=# explain select id from partition_test where locality = 'local2';
QUERY PLAN
-------------------------------------------------------------------------------
Append (cost=0.00..102.50 rows=1000 width=4)
-> Seq Scan on partition_test_local2 (cost=0.00..97.50 rows=1000 width=4)
Filter: ((locality)::text = 'local2'::text)
Yes, despite not having selected the entire primary key, filtering on the partition key is enough for partition pruning to take place, and have partitions that can safely be excluded be excluded.
The issue
Now we are getting to the situation we encountered. We have a partitioned table, and we have a number of partitions. Sometimes, it's not possible to use the defined primary key columns to query the data, but we can use the partition keys. A solution then is to create another index for the partitioned table, which does specify the partition key. However, this can be done in multiple ways.
This is the situation we encountered:
create index partition_test_locality_f1_1
on partition_test_local1 (f1, locality)
tablespace tablespace1
where locality = 'local1';
create index partition_test_locality_f1_2
on partition_test_local2 (f1, locality)
tablespace tablespace1
where locality = 'local2';
create index partition_test_locality_f1_3
on partition_test_local3 (f1, locality)
tablespace tablespace1
where locality = 'local3';
Now let's see if we can use the index to filter on both fields with single values:
yugabyte=# explain select * from partition_test where locality = 'local1' and f1 = 'aaa';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..14.50 rows=100 width=100)
-> Index Scan using partition_test_locality_f1_1 on partition_test_local1 (cost=0.00..14.00 rows=100 width=100)
Index Cond: ((f1)::text = 'aaa'::text)
This works, and leads you to thinking this is not a problem.
Now let's try two locality values:
yugabyte=# explain select * from partition_test where locality in ('local1','local2') and f1 = 'aaa';
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Append (cost=0.00..209.00 rows=2000 width=100)
-> Seq Scan on partition_test_local1 (cost=0.00..99.00 rows=1000 width=100)
Filter: (((locality)::text = ANY ('{local1,local2}'::text[])) AND ((f1)::text = 'aaa'::text))
-> Seq Scan on partition_test_local2 (cost=0.00..100.00 rows=1000 width=100)
Filter: (((locality)::text = ANY ('{local1,local2}'::text[])) AND ((f1)::text = 'aaa'::text))
It does perform the partition pruning, because it didn't access the partition for 'local3'. But that happened because we specified the partition key, and thus would allow elimination of the 'local3' partition.
But...the index access changed to a seq scan: all the rows where scanned, instead of the index that has been created being used. Why did that happen?
The reason is the indexes that are created are not partitioned indexes, but are partial indexes. The part that the partial index covers, is the partition.
The actual issue here is that the planner does not know the relationship between the partial indexes, and there has to check for every row of the index if it's applicable to the query (PostgreSQL documentation chapter 11.8 Partial indexes, bottom part). Despite the documentation saying that partial indexes would require a lot of work when these are defined per partition, which to me says it would be possible to use them, I could not find a way to force usage of per partition partial indexes.
The solution
The solution is not to use partial indexes, and define a regular index on the fields on the table being the table partition for each partition.
First drop the partial indexes:
drop index partition_test_locality_f1_1;
drop index partition_test_locality_f1_2;
drop index partition_test_locality_f1_3;
And create the indexes
create index partition_test_locality_f1_1
on partition_test_local1 (f1, locality)
tablespace tablespace1;
create index partition_test_locality_f1_2
on partition_test_local2 (f1, locality)
tablespace tablespace2;
create index partition_test_locality_f1_3
on partition_test_local3 (f1, locality)
tablespace tablespace3;
If we now explain the above query for f1 and locality again:
yugabyte=# explain select * from partition_test where locality in ('local1','local2') and f1 = 'aaa';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..10.60 rows=20 width=100)
-> Index Scan using partition_test_locality_f1_1 on partition_test_local1 (cost=0.00..5.25 rows=10 width=100)
Index Cond: (((f1)::text = 'aaa'::text) AND ((locality)::text = ANY ('{local1,local2}'::text[])))
-> Index Scan using partition_test_locality_f1_2 on partition_test_local2 (cost=0.00..5.25 rows=10 width=100)
Index Cond: (((f1)::text = 'aaa'::text) AND ((locality)::text = ANY ('{local1,local2}'::text[])))
Now the planner can use the indexes after partition pruning and therefore can use the multiple partitions of the index, just like we saw with a filter on the primary key columns.
If you define indexes on the main partitioned table, the database will create an index partition for each table partition. Sadly, currently the tablespace is not inherited from the table partitions and therefore the index partitions end up being created in the default tablespace.
This means that if you care about the table and index partition locality/tablespace, you should not define indexes on the main table for the partitioned table, but add any non-PK index later to the table partition, like was done above.
Caveat
After pruning the partition, the planner decides on the best scan path. This means it's possible to delete one of the indexes for a table partition, and leave the others in place:
drop index partition_test_locality_f1_2;
And then when we perform a SQL that touches the 'local2' partition as well as another one, we can have different access paths for each partition:
yugabyte=# explain select * from partition_test where locality in ('local1','local2','local3') and f1 = 'aaa';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..116.88 rows=1020 width=100)
-> Index Scan using partition_test_locality_f1_1 on partition_test_local1 (cost=0.00..5.26 rows=10 width=100)
Index Cond: (((f1)::text = 'aaa'::text) AND ((locality)::text = ANY ('{local1,local2,local3}'::text[])))
-> Seq Scan on partition_test_local2 (cost=0.00..101.25 rows=1000 width=100)
Filter: (((f1)::text = 'aaa'::text) AND ((locality)::text = ANY ('{local1,local2,local3}'::text[])))
-> Index Scan using partition_test_locality_f1_3 on partition_test_local3 (cost=0.00..5.26 rows=10 width=100)
Index Cond: (((f1)::text = 'aaa'::text) AND ((locality)::text = ANY ('{local1,local2,local3}'::text[])))
Mind the two index scans, and one seq scan because the index was dropped previously. This is also a situation that can happen if a partition was added, but the index creation for the partition was forgotten.