After looking at the implications of the replication factor for a YugabyteDB cluster, this blogpost looks at how a table "materialises" in a YugabyteDB cluster.
The creation of a table in YugabyteDB YSQL first of all adds the table metadata to the PostgreSQL catalog, exactly like it does in PostgreSQL. This data (the metadata) is stored on the master.
Creating a table will also create the physical storage space which we call 'tablet'. Currently, a table will create its own tablet or tablets by default. An exception to a table creating its own tablets is using a colocated database, in which by default all objects will be created in the same tablet ('colocated in a tablet'). This currently has to be explicitly defined, there are discussions whether we should make colocation the default.
A synonym for a tablet is a shard.
The number of tablets
The number of tablets created for a table depends on a number of parameters set in each tablet server, if not defined with the create table statement:
-
ysql_num_tablets
(default -1) -
ysql_num_shards_per_tserver
(default -1) -
enable_automatic_tablet_splitting
(default true)
The first parameter evaluated by the tablet server is ysql_num_tablets
. If not set, it defaults to -1, which means to evaluate ysql_num_shards_per_tserver
. If set, this is the number of tablets to be created for a table if not explicitly defined in the create table or create index statement.
If the first parameter is default, then the actual parameter used for the number of tablets is ysql_num_shards_per_tserver
, which sets the number of tablets per tablet server.
This parameter by default is -1, but the actual used value is set by the tablet server on startup based on further logic:
- If the parameter
enable_automatic_tablet_splitting
is true (default), then the parameter gets set to 1. - Otherwise the parameter gets set to 2 or higher based on the number of CPUs visible in the operating system:
- 2 tablets: 1-2 CPUs.
- 4 tablets: 3-4 CPUs.
- 8 tablets: 5 or more CPUs.
Please note: these are per tablet server, so for the total number of tablets created, it's this number times the number of tablet servers.
The number of replicas
Once we know the number of tablets for a table or index, the number of replicas is reasonably easy to determine. It's the number of tablets times the replication factor set (RF).
There is an exception to this rule: it is possible to create tablespace with a different replication factor than the cluster set replication factor. Any table or index created in this tablespace will apply the tablespace set replication factor, not the cluster set replication factor.
Looking at a real-life case
Create and inspect a table
The first case to look at is the most simple RF3 case possible. This is creating a table with a SINGLE tablet, which will have 3 replicas. This is done in the following way, via ysqlsh
:
create table test (id int primary key, f1 text) split into 1 tablets;
CREATE TABLE
The information about tables, tablets and replicas can be found on:
- master:/tables
- tablet server:/tablets
Another options is to use yb_stats --print-entities
:
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING
Replicas: (yb-1.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER),)
- Object: the table I created is visible as
ysql.yugabyte.test
: aysql
table, in the databaseyugabyte
with the nametest
. - Tablet: the table ysql.yugabyte.test has a single tablet, named
ca0357120649428a8cc37ff3b6339609
. Tablets do not have names, they have UUIDs. - Replicas: the tablet
ca0357120649428a8cc37ff3b6339609
has replicas on the tablet servers with the namesyb-1.local
,yb-2.local
andyb-3.local
. The replica on nodeyb-1.local
is LEADER.
Another more high level way is to look at the tablet servers:
➜ yb_stats --print-tablet-servers
yb-1.local:9000 ALIVE Placement: local.local.local1
HB time: 0.7s, Uptime: 984, Ram 42.16 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 1/1, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 167526400 (1.56%)
yb-2.local:9000 ALIVE Placement: local.local.local2
HB time: 0.7s, Uptime: 984, Ram 34.28 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 167276544 (1.56%)
yb-3.local:9000 ALIVE Placement: local.local.local3
HB time: 0.3s, Uptime: 982, Ram 33.91 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 167587840 (1.56%)
Look at the row for tablets for each tablet server.
Observing replica availability issues: single tablet server follower failure
What would happen with the table if the tablet server on one of the nodes becomes unavailable? Let's test: let's bring down the tablet server on node 3, a node on which a follower tablet is placed.
sudo systemctl stop yb-tserver
And watch the tablet servers:
➜ yb_stats --print-tablet-servers
yb-1.local:9000 ALIVE Placement: local.local.local1
HB time: 0.4s, Uptime: 1906, Ram 38.44 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 13/24
Path: /mnt/d0, total: 10724835328, used: 167526400 (1.56%)
yb-2.local:9000 ALIVE Placement: local.local.local2
HB time: 0.4s, Uptime: 1906, Ram 31.64 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 1/1, system (leader/total): 11/24
Path: /mnt/d0, total: 10724835328, used: 167276544 (1.56%)
yb-3.local:9000 ALIVE Placement: local.local.local3
HB time: 6.4s, Uptime: 1904, Ram 34.74 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 0/24
Path: /mnt/d0, total: 10724835328, used: 167587840 (1.56%)
This looks a bit odd: the tablet server is still reported as 'ALIVE'.
If you look at the table specifics, nothing alarming is seen too:
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
All 3 replicas are still shown!
If you wait a bit, and then query the tablet servers again, the tablet server is reported as DEAD:
➜ yb_stats --print-tablet-servers
yb-1.local:9000 ALIVE Placement: local.local.local1
HB time: 0.5s, Uptime: 2152, Ram 36.54 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 12/24
Path: /mnt/d0, total: 10724835328, used: 179478528 (1.67%)
yb-2.local:9000 ALIVE Placement: local.local.local2
HB time: 0.5s, Uptime: 2152, Ram 29.35 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 1/1, system (leader/total): 12/24
Path: /mnt/d0, total: 10724835328, used: 179261440 (1.67%)
yb-3.local:9000 DEAD Placement: local.local.local3
HB time: 253.1s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 25, user (leader/total): 0/1, system (leader/total): 0/24
Please mind it still holds 1 user tablet, despite being marked 'DEAD'.
If you look at the table (entity) status now (after the tablet server is marked 'DEAD'), it combines the above data:
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING [UNDER REPLICATED]
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER[DEAD]), yb-2.local:9100(VOTER:LEADER),)
It shows the tablet server on node yb-3.local
as 'DEAD', and the tablet is marked as 'UNDER REPLICATED'. This is a more concrete report on what is going on.
This data comes from the master:/api/v1/health-check endpoint.
If we wait some more, something else happens: the replicas on the dead node are removed:
➜ yb_stats --print-tablet-servers
...snipped for brevity
yb-3.local:9000 DEAD Placement: local.local.local3
HB time: 908.7s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 0, user (leader/total): 0/0, system (leader/total): 0/0
And
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING [UNDER REPLICATED]
Replicas: (yb-1.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
In the tablet servers overview you can see all user as well as system tables are removed, and using the print-entities view you can see the tablet is still reported as UNDER REPLICATED, but the replica on the DEAD node is removed.
Now let's start table tablet server again, which will enable the replicas on it:
sudo systemctl start yb-tserver
If you use yb_stats --print-tablet-servers
, you can see that the user replica for our tablet is placed on the tablet server again. There are some system tablets, and these flow onto the enabled tablet server at a certain pace. If the cluster has higher number of user tables, the same gradual movement will be visible.
Let's test another fundamental scenario:
Observing replica availability issues: single tablet leader failure
Okay, failing a follower, so not an active replica is easy, because it's not what is actually used. But how about failing the leader replica?
Let's query the object data via --print-entities
again:
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER),)
Okay, so the leader tablet is on yb-2.local
. Let's stop that tablet server:
sudo systemctl stop yb-tserver
And look at the object:
➜ yb_stats --print-entities --table-name-match test
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.ca0357120649428a8cc37ff3b6339609 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-2.local:9100(VOTER), yb-3.local:9100(VOTER:LEADER),)
If you been really quickly, you might have found yb-2.local
to be still the LEADER, but in a few seconds you should see the LEADER moving to another node. In the above case it's yb-3.local
.
After this, the sequence of events that happens is identical to the previous investigation.
So, what is happening?
What is happening is dependent on configurable parameters in the follow order:
- The first thing that (might, in case of a LEADER) happen, is the detection of a LEADER replica to be considered failed. LEADER failure is dependent on the parameters
raft_heartbeat_interval_ms
(default: 500) andleader_failure_max_missed_heartbeat_periods
(default: 6). This means that a LEADER failure requires at least 3 seconds to be detected, which is why if you are quickly to check after stopping a tablet server, you can see the LEADER before it fails. What leader failure detection does is initiate an election among the surviving RAFT members. - The second thing that happens is that the tablet server that was stopped is marked 'DEAD' by the master leader. In fact, if you closely monitor the HB time (heartbeat time) using
--print-tablet-servers
you will see the HB time going up after stopping the tablet server. The threshold for marking a tablet server is dependent on the parametertserver_unresponsive_timeout_ms
(default: 60000). - The third thing that happens is that the
dead_nodes
andunder_replicated_tablets
lists are set (in master:/api/v1/health-check) as soon as a tablet server is marked dead. This allowsyb_stats
to mark the tablets and the replicas accordingly. However, outside of setting the status of the tablet server, and having the master creating these lists, and the initial LEADER movement, nothing else happens. The replicas which were located on the DEAD node are still placed there. - The fourth thing that happens is dealing with the replica considered being unavailable. This is done when the master determines a follower as failed, for which the time is set with the parameter
follower_unavailable_considered_failed_sec
(default: 900). What happens depends on situation:- If a tablet server is available that does not already host a replica of the tablet, then a new replica is created on the available tablet server (which we call 'remote bootstrapping').
- If no tablet server is available, then the replica is removed from the list of replicas, even if that means the tablet getting under replicated.
In the case of this example with an RF3 cluster with 3 nodes, with one tablet server being unavailable, the last scenario will be happening: the replica on the unavailable host is removed, leaving the tablet running but under replicated.
It is very important to realize this blogpost was very specifically about single node failure, which means the surviving replicas have a majority and therefore have the ability to vote a new leader, and not disrupt usage of the tablet it is serving.