SQL doesn't need the "Single Table Design" but Composite Primary Keys

Franck Pachot - Jan 10 '23 - - Dev Community

DynamoDB is a NoSQL database where there are no joins. You store pre-joined documents, choosing the hierarchical view for your use case. Let's take a simple example. You have documents, with a header, and subdocuments, with a body. One document has many subdocuments but a subdocument belongs to only one document.

  +--------------------+             +--------------------+
  |      Document      |<>-----------|    Subdocument     |
  +--------------------+             +--------------------+
  |+ docid: int <<PK>> |             |+ subid: int <<PK>> |
  |--------------------|             |--------------------|
  |- header: json      |             |- body: json        |
  +--------------------+             |- docid: int <<FK>> | 
                                     +--------------------+
Enter fullscreen mode Exit fullscreen mode

(I used the UML Composition representation as this is what it is about: a composite aggregation association, which is a 'part of' relationship)

This can be stored with the document id as the partition key and the subdocuments in a JSON attribute, with the header. You can read and write whole documents by their document id with efficient GetItem and PutItem.

However, applications share data with multiple use cases. This is the reason for relational databases and joins. But we will see that later.

If you want direct read access to a subdocument, you can store them as items in another DynamoDB table. A Documents table with the document id as the partition key, and a Subdocuments table with the subdocument id as the partition key. The subdocuments will have the document id as an attribute, to link them to their common document.

But a NoSQL database has no joins and no cross-shard transactions. You cannot get a consistent view of it when you want the document with its subdocuments.

DynamoDB Single Table Design

The solution is to store them in the same table with the Single Table Design. The key will be either a document id or a subdocument id and to distinguish you may add a type to it.

But there's one more concern. One access pattern is to get all subdocuments for one document. If the document id is just an attribute of the subdocument, you can create a Global Secondary Index on it, but consistency is lost. The solution, with the Single Table Design is to put the document id as a prefix of the key. In the single table, a document id will be the key for the document, and a document id concatenated to subdocument id will be the key for the subdocument. For this simple case, document id will be the partition key and subdocument idthe sort key. This has the advantage of getting one of them with a GetItem or a whole document with its subdocuments with Query.

What about SQL databases?

Seeing that, you may think that storing a whole object hierarchy in one table is a good thing and you may be tempted to do the same with a SQL database. But that's a misconception. The benefit comes from the composite key, where the parent key is concatenated as a prefix of the child key. Having them in the same table is just a workaround for the lack of consistent transactions. The interleaving itself doesn't really bring more value for data access performance.

Interleaving is not an innovation

Since early versions of Oracle Database the idea of clustering the master-detail tables was there with the CLUSTER object. Here is how you would create two tables that are physically stored into one, clustered on the document id:

create cluster document (docid int);

create table doc_header (
 primary key(docid)
 , docid int generated always as identity
 , header varchar2(4000)
) cluster document (docid);

create table subdocument (
 primary key(docid, subid)
 , docid int references doc_header
 , subid int generated always as identity
 , body varchar2(4000)
) cluster document (docid);

create index document on cluster document;
Enter fullscreen mode Exit fullscreen mode

The execution plan to select one document (document id = 42) with all subdocuments would have been a single index access:

------------------------------------------------------
| Id  | Operation                    | Name          |
------------------------------------------------------
|   0 | SELECT STATEMENT             |               |
|   1 |  NESTED LOOPS                |               |
|   2 |   TABLE ACCESS BY INDEX ROWID| DOC_HEADER    |
|*  3 |    INDEX UNIQUE SCAN         | SYS_C00415367 |
|*  4 |   TABLE ACCESS CLUSTER       | SUBDOCUMENT   |
------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("DOC_HEADER"."DOCID"=42)
   4 - filter("SUBDOCUMENT"."DOCID"=42)

Enter fullscreen mode Exit fullscreen mode

The TABLE ACCESS CLUSTER is within the same blocks as the physical structure is the CLUSTER. In practice, that has never been used widely: very small benefits for lot of constraints in space management.

Interleaving has never been a good idea

The gain is actually minimal and colocating rows with different structures and cardinalities bring many other problems.

More recently the same was observed in some Distributed SQL databases. Google Spanner started by supporting foreign keys only with interleaved tables. CockroachDB started with the same and removed them for the same reasons: limited theoretical pros and a large cons list.

The most important is to group the subdocuments belonging to the same document together. Having them at a different place than the document header doesn't make a big difference: two reads instead of one. In all cases, this doesn't depend on the size of the tables. And anyway, for space management in the storage, you have to read many chunks even if all is logically clustered.

Storing the subdocument together is easy in a database where rows are stored in the Primary Key structure (like SQL Server, Oracle with IOT, MySQL, or YugabyteDB). With databases storing in heap tables (like Oracle default tables or PostgreSQL) it is more difficult. So, with databases storing in the primary index, you just need to have the "document id" in front.

Composite key for master-detail

Like what was re-discovered with DynamoDB Single Table Design, the key to scalability is the common prefix of composite keys. Like in my example above with Oracle CLUSTER: primary key(docid, subid) is the key of the Subdocument table. The primary key is sorted, in B-Tree or LSM-Tree, for fast access. Then, having the parent key in front keeps the child together. In a master-detail composition, the foreign key to the parent is the first column of the primary key.

With this easy design:

  • you don't need an additional index on the foreign key
  • you can partition on the same key (for partition-wise join)
  • you can read all child rows with a single range scan
  • you benefit from read-ahead and cache locality

This was well known in SQL but, for what I think was a mistake, composite keys became unpopular and this design was not possible anymore. I summarized this in a tweet:

I point to the ORMs (Object-Relational Mappers) as the culprits because they discouraged the use of composite keys. Probably because it was simpler for them to have all primary keys as single columns. And then came the confusion between composite keys and natural keys: if you recommend a surrogate key for all tables, then you have only single-column primary keys.

But if you look at my two tables above, I'm using generated keys everywhere. The Subdocument primary key is a concatenation of two surrogate keys.

The Hibernate documentation at that time mentioned in the mapping section: There is an alternative declaration that allows access to legacy data with composite keys. Its use is strongly discouraged for anything else.

Hibernate In Action associated Legacy schemas and composite keys
“Legacy schemas and composite keys”

JPA says the same: Composite primary keys typically arise when mapping from legacy databases when the database key is comprised of several columns.

I think this was a big misunderstanding, missing the following points:

  • composite id doesn't imply natural id
  • there's no reason to add a surrogate key to an association table or the child table of a composition
  • ORMs can map composite key with @EmbeddedId

If a tool doesn't support composite key, you should fix the tool. Not the data model. Or it is not a tool for SQL databases.

I'll show the two solutions, composite key or surrogate key, in a modern Distributed SQL database: YugabyteDB

Single-column key for the child does random reads

Here is the creation of the two tables where the key of subdocument is only subid, and docid is only a foreign key:

create table doc_header (
 primary key(docid)
 , docid  bigint generated always as identity
 , header jsonb
) split into 10 tablets;

create table subdocument (
 primary key(subid)
 , docid bigint references doc_header
 , subid bigint generated always as identity
 , body  jsonb
) split into 10 tablets;

set yb_enable_upsert_mode = on;

insert into doc_header(header) select '{}' from generate_series(1,100000) docid;

insert into subdocument(docid, body) select docid, '{}' from doc_header,generate_series(1,100) subid;


create index subdoc_doc_fk on subdocument(docid);

Enter fullscreen mode Exit fullscreen mode

I have inserted one hundred thousand documents with an hundred of subdocuments each. Because docid is not part of the primary key, I need a secondary index to be able get all subdocuments for a document.

Here is the execution plan when reading one document with all its subdocuments:

yugabyte=# explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                            QUERY PLAN
-------------------------------------------------------------------------------------------------
 Nested Loop (actual time=4.279..4.322 rows=100 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.263..1.264 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
   ->  Index Scan using subdoc_doc_fk on subdocument (actual time=3.009..3.040 rows=100 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
         Storage Table Read Requests: 1
         Storage Table Execution Time: 2.000 ms
 Planning Time: 0.087 ms
 Execution Time: 4.366 ms
 Storage Read Requests: 3
 Storage Write Requests: 0
 Storage Execution Time: 4.000 ms
 Peak Memory Usage: 30 kB
(17 rows)


Enter fullscreen mode Exit fullscreen mode

Contrary to the myth that joins don't scale, this is fast and not dependent on the size of the tables (you may have seen large time complexity formulas like O(log(N)+Nlog(M)+Nlog(M)+Nlog(M)) - they are wrong).

One row was read, by primary key, from the document table. Then the index entry for the subdocuments (Storage Index Read Requests). Then the 100 subdocuments from the table (Storage Table Read Requests). Those 100 rows are scattered within the table, which can add latency if they are disk I/O and this may not scale except with lot of memory. You don't see it here but if you run the same query for different values from many sessions, you will see more disk I/O happening.

Composite key for the child table for efficient scan

Here is the same where the foreign key belongs to the primary key. There's no need for a secondary index here, which is a big advantage:

create table doc_header (
 primary key(docid)
 , docid  bigint generated always as identity
 , header jsonb
) split into 10 tablets;

create table subdocument (
 primary key(docid, subid)
 --primary key(subid)
 , docid bigint references doc_header
 , subid bigint generated always as identity
 , body  jsonb
) split into 10 tablets;

set yb_enable_upsert_mode = on;

insert into doc_header(header) select '{}' from generate_series(1,100000) docid;

insert into subdocument(docid, body) select docid, '{}' from doc_header,generate_series(1,100) subid;
Enter fullscreen mode Exit fullscreen mode

Here is the nested loop between the two primary keys with Storage Index Read Requests and no Storage Table Read Requests:

yugabyte=# explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Nested Loop (actual time=3.030..3.068 rows=100 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.418..1.419 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 2.000 ms
   ->  Index Scan using subdocument_pkey on subdocument (actual time=1.607..1.633 rows=100 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
 Planning Time: 0.085 ms
 Execution Time: 3.108 ms
 Storage Read Requests: 2
 Storage Write Requests: 0
 Storage Execution Time: 3.000 ms
 Peak Memory Usage: 30 kB
(15 rows)
Enter fullscreen mode Exit fullscreen mode

This makes it even more scalable. Being an Index Scan on the primary key, it reads all rows sequentially.

Scalability thanks to YugabyteDB sharding

I'm running the same with a single table (range sharding) to show the scalability, because with range sharding the default is to start with one tablet:

create table doc_header (
 primary key(docid asc)
 , docid  bigint generated always as identity
 , header jsonb
);

create table subdocument (
 primary key(docid asc, subid)
 --primary key(subid)
 , docid bigint references doc_header
 , subid bigint generated always as identity
 , body  jsonb
);

insert into doc_header(header) select '{}' from generate_series(1,100000) docid;

insert into subdocument(docid, body) select docid, '{}' from doc_header,generate_series(1,100) subid;
Enter fullscreen mode Exit fullscreen mode

Compared to the previous with 10 tablets the time is 10 times longer:

yugabyte=# explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                               QUERY PLAN
------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=61.705..61.760 rows=100 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.397..1.399 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
   ->  Index Scan using subdocument_pkey on subdocument (actual time=60.300..60.342 rows=100 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 61.000 ms
 Planning Time: 0.087 ms
 Execution Time: 61.807 ms
 Storage Read Requests: 2
 Storage Write Requests: 0
 Storage Execution Time: 62.000 ms
 Peak Memory Usage: 30 kB
(15 rows)
Enter fullscreen mode Exit fullscreen mode

The previous execution plans makes it clear that the response time doesn't depends on the total size of the tables as it reads only the rows that are needed for the result: 1 row from the outer table document, in the read request, and 100 rows from the subdocument, in one read request.

The time depends mostly on the size of the result, 100 rows here, and as seen previously, this scales with hash sharding as in the previous test.

Let's add more subdocuments:

insert into subdocument(docid, body) select 42, '{}' from generate_series(101,1000) subid;

yugabyte=> explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=64.125..64.480 rows=1000 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.374..1.375 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
   ->  Index Scan using subdocument_pkey on subdocument (actual time=62.744..62.994 rows=1000 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 63.000 ms
 Planning Time: 0.082 ms
 Execution Time: 64.558 ms
 Storage Read Requests: 2
 Storage Write Requests: 0
 Storage Execution Time: 64.000 ms
 Peak Memory Usage: 30 kB
(15 rows)

insert into subdocument(docid, body) select 42, '{}' from generate_series(1001,10000) subid;

yugabyte=> explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                               QUERY PLAN                                              
--------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=64.850..101.455 rows=10000 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.371..1.372 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
   ->  Index Scan using subdocument_pkey on subdocument (actual time=63.472..98.891 rows=10000 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 10
         Storage Index Execution Time: 97.001 ms
 Planning Time: 0.087 ms
 Execution Time: 101.889 ms
 Storage Read Requests: 11
 Storage Write Requests: 0
 Storage Execution Time: 98.001 ms
 Peak Memory Usage: 30 kB
(15 rows)
Enter fullscreen mode Exit fullscreen mode

When the number is higher than 1024 (the default for ysql_prefetch_limit) we have more read requests. A document with 10000 subdocuments is fetched in 100 milliseconds which is not a lot when compared to the 60ms for 100 rows.

The table is still considered small for auto-splitting:

yugabyte=> select num_tablets,
          pg_size_pretty(pg_table_size('subdocument'::regclass))
          from yb_table_properties('subdocument'::regclass);

 num_tablets | pg_size_pretty
-------------+----------------
           1 | 354 MB

(1 row)
Enter fullscreen mode Exit fullscreen mode

I increased the size by running those random updates for a long time:

create extension orafce;
update subdocument set body=format('{ "filler":"%s"}',dbms_random.string('A',100000))::jsonb where docid = (random()*100000)::int
\watch 0.1
Enter fullscreen mode Exit fullscreen mode

Automatically the table is split to balance the load:

yugabyte=> select num_tablets,
          pg_size_pretty(pg_table_size('subdocument'::regclass))
          from yb_table_properties('subdocument'::regclass);

 num_tablets | pg_size_pretty
-------------+----------------
          12 | 6759 MB

(1 row)
Enter fullscreen mode Exit fullscreen mode

For this 10 times larger table, the response time is still under 500 milliseconds:

yugabyte=> explain (analyze, dist, costs off)
           select * from doc_header join subdocument using(docid)
           where docid=42;

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=18.812..251.278 rows=10000 loops=1)
   ->  Index Scan using doc_header_pkey on doc_header (actual time=1.262..1.263 rows=1 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 1
         Storage Index Execution Time: 1.000 ms
   ->  Index Scan using subdocument_pkey on subdocument (actual time=17.544..248.527 rows=10000 loops=1)
         Index Cond: (docid = 42)
         Storage Index Read Requests: 10
         Storage Index Execution Time: 244.001 ms
 Planning Time: 0.091 ms
 Execution Time: 251.845 ms
 Storage Read Requests: 11
 Storage Write Requests: 0
 Storage Execution Time: 245.001 ms
 Peak Memory Usage: 2006 kB
(15 rows)
Enter fullscreen mode Exit fullscreen mode

Because I've defined range sharding and I'm querying only for one docid, I'm reading only one tablet. But thanks to auto-splitting, concurrent queries reading other documents will load other servers. This balances the disk I/O, the RAM usage and the CPU, so the total workload scales with range sharding as well.

Huge joins can scale in SQL databases

This was reading one document, joining it with thousands of large subdocuments in 250 milliseconds. Now let's try to join all documents with their subdocuments. We have 10 millions rows from that:

explain (analyze, dist, costs off)
/*+ mergejoin(doc_header subdocument) */
select docid,header,count(*)
 from doc_header join subdocument using(docid)
group by docid
;

                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 GroupAggregate (actual time=4737.375..488002.905 rows=100000 loops=1)
   Group Key: doc_header.docid
   ->  Merge Join (actual time=4735.282..487123.055 rows=10009900 loops=1)
         Merge Cond: (subdocument.docid = doc_header.docid)
         ->  Index Scan using subdocument_pkey on subdocument (actual time=4728.913..485443.225 rows=10009900 loops=1)
               Storage Index Read Requests: 9784
               Storage Index Execution Time: 479679.699 ms
         ->  Index Scan using doc_header_pkey on doc_header (actual time=6.359..56.974 rows=100000 loops=1)
               Storage Index Read Requests: 98
               Storage Index Execution Time: 6.000 ms
 Planning Time: 0.156 ms
 Execution Time: 488017.105 ms
 Storage Read Requests: 9882
 Storage Write Requests: 0
 Storage Execution Time: 479685.699 ms
 Peak Memory Usage: 100431 kB
(16 rows)
Enter fullscreen mode Exit fullscreen mode

On this small cluster (3 nodes 4 vCPU) it takes 8 minutes to join all those rows. What would we gain here by interleaving the rows from the two tables? Nothing significant. The join is fast here because there's no additional operation. Thanks to having the join column in front of their primary key, the Index Scan return the rows in the right order and Merge Join interleave them while iterating on the two sources.

There's no need for a "Single Table Design" in a SQL database. A composite key is sufficient to keep the rows together. Storing them in different tables gives more agility in space management and allows to read one, or the other, or both. With databases that stores the tables in their primary index, like YugabyteDB LSM-tree, all those access patterns are optimal, with no additional reads, and without maintaining additional indexes.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .