🚀 Think about Primary Key & Indexes before anything else 🐘

Franck Pachot - Nov 7 '21 - - Dev Community

A recent tweet by Nikolay Samokhvalov, draws attention to the importance of understanding database structures:

Andy Pavlo also raised the point that people are suggesting many radical changes, without understanding proper indexing:

In this post, I'll take this example to explain that thinking about indexes should not be too difficult. And, anyway, this work must be done before scaling to a distributed database. The problem was encountered in PostgreSQL. I'll run my demo on YugabyteDB to show that the concepts are the same on a distributed database. The SQL is the same.

James Long's approach was good: ask the community and provide all required information, the execution plan and index definition: https://gist.github.com/jlongster/4b31299dcb622aa7e29b59d889db2b2c#file-gistfile1-txt

With such information, the problem is easy to reproduce:

yugabyte=# \c yugabyte yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)
You are now connected to database "yugabyte" as user "yugabyte".

yugabyte=# create table messages_binary (
            "timestamp" text,
            "group_id" uuid,
            "other_column" int,
            primary key("timestamp","group_id")
           );
CREATE TABLE

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
------------------
 Seq Scan on messages_binary  (cost=0.00..105.00 rows=1000 width=52)
   Filter: (("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text) AND (group_id = '983d5259-97ff-49e3-8829-101a
b8dead92'::uuid))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

This is a full table scan. And this is not what we want because it reads all rows where we need only one "group_id". What we need is a range scan.

Let's insert a few rows (3 timestamps in 3 groups):

yugabyte=# create extension pgcrypto;
CREATE EXTENSION

yugabyte=# insert into messages_binary
           with groups as (
            select gen_random_uuid() group_id from generate_series(1,3)
           )
           select
            to_char(now()+(generate_series(1,3)*interval'1 second')
             ,'yyyy-mm-ddThh24:mi:ss.000Z-')
             ||substr(gen_random_uuid()::text,25) "timestamp"
            ,group_id, 42 as "value"
           from groups;
INSERT 0 9

yugabyte=# select * from messages_binary;

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

In YugabyteDB rows are sharded and stored into the primary index itself. In PostgreSQL, they are appended into a heap table, with an additional index on the primary key. In both cases, a range scan access depends on the primary key, which is defined here as ("timestamp","group_id"). And we can see that the rows I need for group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e are scattered in this Seq Scan result.

Let's have an idea of the order of the primary key, with a SELECT ... ORDER BY on the same columns:

yugabyte=# select * from messages_binary
            order by "timestamp","group_id";

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

Now you can understand how inefficient is the query with the WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e' AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e' predicate. We start on the first row because it verifies timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e' and from this we have to scan all rows and filter them afterwards. There is no data structure where the interesting rows can be found in a small range that can be read alone. This explains the Seq Scan.

We need a structure like this one, ordered on "group_id" first:

yugabyte=# select * from messages_binary
            order by "group_id","timestamp";

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

On this structure (look at the row order, I didn't change the column order), the database engine can:

  • seek to the first group_id='e7e46753-2e99-4ee4-b77f-17136b01790e',
  • additionally seek to the first timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e',
  • read the following rows in sequence,
  • and stop at the last one with group_id='e7e46753-2e99-4ee4-b77f-17136b01790e'.

How to get this structure? Easy to define with another index:

yugabyte=# create index messages_binary_key2 
            on messages_binary ("group_id","timestamp");
CREATE INDEX
Enter fullscreen mode Exit fullscreen mode

Here is the execution plan:

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Scan using messages_binary_key2 on messages_binary  (cost=0.00..5.25 rows=10 width=52)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

This is efficient. It can be even better if we add all the columns selected into the index, with the INCLUDE clause, like:

yugabyte=# create index messages_binary_key2 
           on messages_binary ("group_id","timestamp")
           include ("other_column");
CREATE INDEX

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Only Scan using messages_binary_key2 on messages_binary  (cost=0.00..5.15 rows=10 width=52)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

I've detailed this Index Only Scan technique in:
https://blog.yugabyte.com/how-a-distributed-sql-database-boosts-secondary-index-queries-with-index-only-scan/

With a little more analysis, there's a possibility that the index on ("timestamp","group_id") is not useful at all because there's a low chance that we have a query on the timestamp only without the group_id.
Then it would be better to define the table as:

yugabyte=# create table messages_binary (
            "timestamp" text,
            "group_id" uuid,
            "other_column" int,
            primary key("group_id","timestamp")
            );
CREATE TABLE
Enter fullscreen mode Exit fullscreen mode

I'm inserting more rows and look at the execution plan:

yugabyte=# insert into messages_binary
           with groups as (
            select gen_random_uuid() group_id from generate_series(1,1e3)
           )
           select
            to_char(now()+(generate_series(1,1e4)*interval'1 second')
             ,'yyyy-mm-ddThh24:mi:ss.000Z-')
             ||substr(gen_random_uuid()::text,25) "timestamp"
            ,group_id, 42 as "value"
           from groups;

yugabyte=# analyze messages_binary;
ANALYZE

yugabyte=# EXPLAIN (analyze)
           SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                                                                                                                                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Scan using messages_binary_pkey on messages_binary  (cost=0.00..1214.95 rows=10530 width=52) (actual time=10.588..100.838
 rows=10000 loops=1)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
 Planning Time: 0.067 ms
 Execution Time: 101.711 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The 10000 rows are retrieved in 100 milliseconds, in my small 8 vCPU lab, from a 10 million rows table but I can tell you that it does not depend on the size of the table. This is another benefit of understanding access patterns: you know how it scales. This Index Scan is nearly O(1) on the table size. It is actually a 'log n' but for large range scan, the branch-to-leaves in B-Tree - or seek in SST files of LSM tree - is negligible. And O(n) on the result size.

The updates in italics come after discussing with Nikolay Samokhvalov (postgres.ai). We also compared YugabyteDB with PostgreSQL results. My query generating rows inserted them in "group id" order, This doesn't change anything in YugabyteDB because rows are organized on the primary key, but the performance in PostgreSQL heap tables depends on the index/table correlation. Here is an example, in PostgreSQL, with a more real-life situation ordered on timestamp:

postgres=# truncate table messages_binary;
TRUNCATE TABLE
postgres=# insert into messages_binary
           with groups as (
                       select gen_random_uuid() group_id from generate_series(1,1e3)
                      ), timestamps as (
                       select to_char(now()+(generate_series(1,1e4)*interval'1 second')
                        ,'yyyy-mm-ddThh24:mi:ss.000Z-')
                        ||substr(gen_random_uuid()::text,25) "timestamp"
                      )
                      select
                       "timestamp",group_id, 42 as "value"
                      from timestamps,groups;
INSERT 0 10000000

postgres=# select * from messages_binary limit 1;
               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-08T08:34:14.000Z-6bb27dbe2723 | 91ee7381-eb92-48cd-bb82-9ed939dc3a13 |           42
(1 row)

postgres=# EXPLAIN (analyze,buffers)
           SELECT * FROM messages_binary
           WHERE group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on messages_binary  (cost=778.61..30792.78 rows=9956 width=58) (actual time=3.021..43.141 rows=10000 loops=1)
   Recheck Cond: ((group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text))
   Heap Blocks: exact=10000
   Buffers: shared hit=2006 read=8167 written=2249
   ->  Bitmap Index Scan on messages_binary_pkey  (cost=0.00..776.12 rows=9956 width=0) (actual time=1.897..1.897 rows=10000 loops=1)
         Index Cond: ((group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text))
         Buffers: shared hit=11 read=162
 Planning Time: 0.086 ms
 Execution Time: 43.832 ms
Enter fullscreen mode Exit fullscreen mode

Storage latency and memory available for buffers will change the numbers. The automatic compaction of LSM tree in YugabyteDB, and a manual pg_repack in PostgreSQL can help. But basically we are in the same ballpark. When the response time is in orders of magnitudes higher than expected, the design of the data access is what we have to look at. Then, each engine may have some additional optimizations.

This is the best you can do for this query, without additional indexes, just defining the right order of columns in the primary key. In PostgreSQL this still has to random read from the table, but at least all is filtered from the range scan on the primary key index. In YugabyteDB, all rows will be retrieved with sequential reads:

  • sharding is done on the primary key to read from one tablet only
  • SST files have bloom filter to skip many of them
  • They index blocks to read only the required ones

An additional remark from Alvaro Hernández questions the number of rows retreived by the query:

And one more thing, James Long's execution plan shows (group_id = '983d5259-97ff-49e3-8829-101ab8dead92'::text) in the Index Conditions. Storing UUID as TXT is not efficient, I've used the uuid datatype here:

yugabyte=# select pg_column_size(
           'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid);

 pg_column_size
----------------
             16
(1 row)

yugabyte=# select pg_column_size(
           'e7e46753-2e99-4ee4-b77f-17136b01790e'::text);

 pg_column_size
----------------
             40
(1 row)
Enter fullscreen mode Exit fullscreen mode

In summary, before declaring that your "database don't scale", there's no shortcut to understanding your access pattern: the size of the expected result and the structure to access it efficiently. And do as James Long: read the execution plan and ask the community 👍

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .