🚀 hash+hash partitioning+sharding

Franck Pachot - Jun 22 '22 - - Dev Community

In YugabyteDB there is partitioning at SQL level, the PostgreSQL declarative partitioning, and there is sharding at table/index/partition level, the automatic hash or range sharding. I explained those in a previous blog post. In short, Sharding is used to automatically distribute data across the cluster. Partitioning is used to group data together. For example, you partition by range on a timestamp to colocate on time windows for information lifecycle management. Or you group on a list of countries to map to specific regions for data residency. Or you group on a list of tenants to isolate them from the others.

PostgreSQL has also hash partitioning, but you probably don't use it in YugabyteDB because its main goal is distribution, and this is better achieved with sharding to tablets: more partitions, automatic rebalance, global indexes,...

But, never say never, what if you think you need hash partitioning plus hash sharding? It is important to be sure that their hashing method work well together to distributed data. Basically, I want that within each partition, the sharding to tablets is well balanced. One way is to look at the algorithm. Another is to test only your specific datatypes.

Here, I'll create a table with n partitions, fill it with one million rows, and look at the yb_hash_code() in each, which is the YugabyteDB hash function.

This is the script I've run, doing this from 1 to 50 partitions:

c=1000000
for n in {1..50} ; do
  echo "
  $n Partitions, $c rows:"
  {
    cat <<SQL
    drop table demo;
    create extension if not exists pgcrypto;
    create table demo (id uuid default gen_random_uuid(), val int) partition by hash(id);
SQL
    for i in $(seq 0 $(( $n -1 ))) ; do
    cat <<SQL
    create table demo$i partition of demo for values with (modulus $n , remainder $i);
SQL
    done
    cat <<SQL
    insert into demo ( val) select generate_series(1,$c);
SQL
    for i in $(seq 0 $(( $n -1 ))) ; do
    cat <<SQL
    select format('Partition %s / %s : min= %s max= %s -> %s %% rows',to_char($i,'99'),to_char($n,'99'),min(yb_hash_code(id)),max(yb_hash_code(id)),100*count(*)/$c) from demo$i;
SQL
    done
  } | psql -p 5433
done | grep Partition | tee hash-hash.log
Enter fullscreen mode Exit fullscreen mode

The result:

1 Partitions, 1000000 rows:
 Partition   0 /   1 : min= 0 max= 65535 -> 100 % rows
2 Partitions, 1000000 rows:
 Partition   0 /   2 : min= 0 max= 65535 -> 49 % rows
 Partition   1 /   2 : min= 0 max= 65535 -> 50 % rows
3 Partitions, 1000000 rows:
 Partition   0 /   3 : min= 0 max= 65535 -> 33 % rows
 Partition   1 /   3 : min= 0 max= 65535 -> 33 % rows
 Partition   2 /   3 : min= 0 max= 65535 -> 33 % rows
4 Partitions, 1000000 rows:
 Partition   0 /   4 : min= 0 max= 65535 -> 25 % rows
 Partition   1 /   4 : min= 0 max= 65535 -> 24 % rows
 Partition   2 /   4 : min= 0 max= 65535 -> 24 % rows
 Partition   3 /   4 : min= 0 max= 65535 -> 24 % rows
5 Partitions, 1000000 rows:
 Partition   0 /   5 : min= 0 max= 65535 -> 19 % rows
 Partition   1 /   5 : min= 0 max= 65535 -> 20 % rows
 Partition   2 /   5 : min= 0 max= 65535 -> 20 % rows
 Partition   3 /   5 : min= 0 max= 65535 -> 19 % rows
 Partition   4 /   5 : min= 0 max= 65535 -> 20 % rows
6 Partitions, 1000000 rows:                                                                  [183/1982]
 Partition   0 /   6 : min= 0 max= 65535 -> 16 % rows
 Partition   1 /   6 : min= 0 max= 65535 -> 16 % rows
 Partition   2 /   6 : min= 0 max= 65535 -> 16 % rows
 Partition   3 /   6 : min= 0 max= 65535 -> 16 % rows
 Partition   4 /   6 : min= 0 max= 65535 -> 16 % rows
 Partition   5 /   6 : min= 0 max= 65534 -> 16 % rows
7 Partitions, 1000000 rows:
 Partition   0 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   1 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   2 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   3 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   4 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   5 /   7 : min= 0 max= 65535 -> 14 % rows
 Partition   6 /   7 : min= 0 max= 65534 -> 14 % rows
8 Partitions, 1000000 rows:
 Partition   0 /   8 : min= 2 max= 65535 -> 12 % rows
 Partition   1 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   2 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   3 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   4 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   5 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   6 /   8 : min= 0 max= 65535 -> 12 % rows
 Partition   7 /   8 : min= 0 max= 65535 -> 12 % rows
...
50 Partitions, 1000000 rows:
 Partition   0 /  50 : min= 2 max= 65534 -> 1 % rows
 Partition   1 /  50 : min= 2 max= 65531 -> 2 % rows
 Partition   2 /  50 : min= 5 max= 65527 -> 1 % rows
 Partition   3 /  50 : min= 3 max= 65534 -> 2 % rows
 Partition   4 /  50 : min= 2 max= 65532 -> 2 % rows
 Partition   5 /  50 : min= 0 max= 65530 -> 1 % rows
 Partition   6 /  50 : min= 6 max= 65534 -> 2 % rows
 Partition   7 /  50 : min= 0 max= 65526 -> 2 % rows
 Partition   8 /  50 : min= 3 max= 65535 -> 2 % rows
 Partition   9 /  50 : min= 3 max= 65535 -> 1 % rows
 Partition  10 /  50 : min= 4 max= 65534 -> 1 % rows
 Partition  11 /  50 : min= 2 max= 65535 -> 1 % rows
 Partition  12 /  50 : min= 2 max= 65533 -> 2 % rows
 Partition  13 /  50 : min= 0 max= 65535 -> 1 % rows
 Partition  14 /  50 : min= 14 max= 65529 -> 2 % rows
 Partition  15 /  50 : min= 0 max= 65530 -> 1 % rows
 Partition  16 /  50 : min= 0 max= 65535 -> 2 % rows
 Partition  17 /  50 : min= 4 max= 65528 -> 1 % rows
 Partition  18 /  50 : min= 1 max= 65533 -> 1 % rows
 Partition  19 /  50 : min= 6 max= 65532 -> 1 % rows
 Partition  20 /  50 : min= 0 max= 65533 -> 2 % rows
 Partition  21 /  50 : min= 0 max= 65535 -> 1 % rows                                         
 Partition  22 /  50 : min= 4 max= 65534 -> 2 % rows
 Partition  23 /  50 : min= 1 max= 65529 -> 2 % rows
 Partition  24 /  50 : min= 2 max= 65530 -> 2 % rows
 Partition  25 /  50 : min= 5 max= 65529 -> 2 % rows
 Partition  26 /  50 : min= 3 max= 65518 -> 1 % rows
 Partition  27 /  50 : min= 4 max= 65534 -> 1 % rows
 Partition  28 /  50 : min= 0 max= 65529 -> 1 % rows
 Partition  29 /  50 : min= 6 max= 65530 -> 2 % rows
 Partition  30 /  50 : min= 7 max= 65522 -> 2 % rows
 Partition  31 /  50 : min= 11 max= 65533 -> 2 % rows
 Partition  32 /  50 : min= 2 max= 65534 -> 1 % rows
 Partition  33 /  50 : min= 1 max= 65526 -> 2 % rows
 Partition  34 /  50 : min= 3 max= 65535 -> 2 % rows
 Partition  35 /  50 : min= 4 max= 65534 -> 2 % rows
 Partition  36 /  50 : min= 2 max= 65534 -> 2 % rows
 Partition  37 /  50 : min= 0 max= 65528 -> 1 % rows
 Partition  38 /  50 : min= 2 max= 65532 -> 2 % rows
 Partition  39 /  50 : min= 0 max= 65534 -> 2 % rows
 Partition  40 /  50 : min= 1 max= 65527 -> 2 % rows
 Partition  41 /  50 : min= 1 max= 65535 -> 1 % rows
 Partition  42 /  50 : min= 0 max= 65535 -> 2 % rows
 Partition  43 /  50 : min= 5 max= 65533 -> 2 % rows
 Partition  44 /  50 : min= 7 max= 65533 -> 1 % rows
 Partition  45 /  50 : min= 10 max= 65534 -> 2 % rows
 Partition  46 /  50 : min= 2 max= 65534 -> 2 % rows
 Partition  47 /  50 : min= 4 max= 65534 -> 1 % rows
 Partition  48 /  50 : min= 1 max= 65526 -> 1 % rows
 Partition  49 /  50 : min= 4 max= 65528 -> 2 % rows

Enter fullscreen mode Exit fullscreen mode

In each partition, I have the whole range of hash codes (from 0 to 65535) which are used for sharding (by ranges of hash code), and the distribution of rows to partitions is well balanced.

I take one partition and look at the yb_hash_code() repartition:

yugabyte=# 

           select n,count(h) from (
            select yb_hash_code(id) h,count(*) n
            from demo42 group by yb_hash_code(id)
           ) x  group by n order by n;

 n | count
---+-------
 1 | 14791
 2 |  2216
 3 |   233
 4 |    18
 5 |     3
(5 rows)
Enter fullscreen mode Exit fullscreen mode

For 14791 rows within the 20009 of this partition, a yb_hash_code() maps to one row only. And there's only 3 yb_hash_codes() that map to 5 rows. This shows a good distribution within a partition, even when the partition is already a result of hash partitioning.

Each partition looks like this:
split
Here, this partition is split to 3 tablets, in 3 ranges of yb_hash_code(): hash_split: [0x0000, 0x5555) goes from 0 to 21844, hash_split: [0x5555, 0xAAAA) from 21845 to 43689 and hash_split: [0xAAAA, 0xFFFF] from 43690 to 65535. With more data, they will be split further.

Note that the hashing algorithm is very different: PostgreSQL uses a modulo on hash value. For this, you must know the number of partition from the beginning. YugabyteDB uses ranges, which makes it easier to split further. Here is a good read about the difference: https://ben.kirw.in/2018/12/02/hash-range-partitioning/

In summary, Hash Partitioning + Hash Sharding works to distribute data. You can test if for your partition key and your datatypes. But, most important, I think you should not need it. Hash Sharding alone should be sufficient. With hash sharding, you can have many tablets. With a 65535 hash code value, and the recommended tablet count and size, you can distribute very large tables. But if you think you do, I'm interested to know about the use case.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .