In the previous articles we seen how the replication factor (RF) works, how tablets and their replicas work based on the replication factor, and what happens when a minority fails (the tablet remains serving) and when a majority fails (the tablet goes down).
But what happens when a minority fails and is destroyed 'forever'?
I started up minimalistic YugabyteDB cluster with 3 tablet servers:
➜ yb_stats --print-tablet-servers
yb-1.local:9000 ALIVE Placement: local.local.local1
HB time: 0.9s, Uptime: 487, Ram 33.90 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 166162432 (1.55%)
yb-2.local:9000 ALIVE Placement: local.local.local2
HB time: 0.9s, Uptime: 487, Ram 33.10 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 166121472 (1.55%)
yb-3.local:9000 ALIVE Placement: local.local.local3
HB time: 0.5s, Uptime: 486, Ram 34.31 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
Path: /mnt/d0, total: 10724835328, used: 166395904 (1.55%)
For the sake of the demo, we need to have actual user objects.
Let's create a simple table, and fill it with data:
create table test (id int primary key, f1 text) split into 1 tablets;
insert into test select id, repeat('x',1000) from generate_series(1,1000000) id;
Let's see how that looks like:
➜ yb_stats --print-entities
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-1.local:9100(VOTER),)
One object (ysql.yugabyte.test), with one tablet, with 3 replicas. All three replicas are 'VOTER', the replica on yb-2.local is LEADER.
Exterminate!
Now let's simulate permanent tablet server 'destruction'. This is a combination of stopping the tablet server:
sudo systemctl stop yb-tserver
Which is what is shown in
Watching a YugabyteDB table: tablets and replicas.
...and the second step is to delete the tablet server data, which is stored at the location indicated by the parameter fs_data_dirs
:
-
fs_data_dirs
/pg_data (intermediate postgres files for YSQL) -
fs_data_dirs
/yb-data/tserver Obviously this makes the data from the tablet server be gone. In my cluster this done in this way:
sudo rm -rf /mnt/d0/pg_data
sudo rm -rf/mnt/d0/yb-data/tserver
Important!
But there is more, which is important to know: if the tablet server now is started again, it will initialise itself because there is no prior data, and start as a completely new tablet server, despite it having the exact same hostname and port number. This is visible using the tablet server screen in the master (master/tablet-servers) with the tablet server UUID: that has changed, making it a totally different server for the cluster.
The situation after termination
I stopped the tablet server on my node yb-3.local
(my "third node"). After tserver_unresponsive_timeout_ms
time, the tablet server will be considered DEAD, and the thus the tablet under replicated:
➜ yb_stats --print-entities
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING [UNDER REPLICATED]
Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER[DEAD]), yb-1.local:9100(VOTER),)
And because the number of unavailable replicas is a minority, it will be removed from the tablet administration after follower_unavailable_considered_failed_sec
.
Adding the tablet server
Because we removed the data directories, the tablet server on the third node will be a completely new tablet server, which for the cluster means it has a different UUID. The cluster does not take the server name or ip address into account for data usage, it uses the UUID exclusively.
Start up the tablet server:
sudo systemctl start yb-tserver
If now the database objects on the YugabyteDB cluster are queried again you will find the tablet having added the 3rd replica:
➜ yb_stats --print-entities
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
Replicas: (yb-2.local:9100(VOTER:LEADER), yb-1.local:9100(VOTER), yb-3.local:9100(VOTER),)
This is great, but you should be aware a lot of things happened automatically, which the YugabyteDB cluster handled for you:
The situation before starting the tablet server that we "exterminated" by removing the data from it was (for the user table) that:
- The table is using a single tablet, and there was 1 user tablet in this test scenario.
- The tablet is using replication factor 3, so for it to function we need a majority (2 replicas) at least, which allows a leader to be elected, which was the case.
When the tablet server was started:
- The cluster (load balancer) found a placement option for the third replica.
- The third replica was placed on the tablet server using a process called 'remote bootstrapping'.
- After the replica was bootstrapped, the data was transferred.
- The newly added replica was synchronised and added to the tablet, making it a fully functioning replica again.
Warning
On my test cluster, this mostly happened. My test cluster is sized smaller than the minimal sizing rules found here: hardware requirements.
As a consequence, the remote bootstrapping put the replica in the status 'PRE_VOTER', and tried allocating a buffer for data transfer, which it couldn't because the memory area was too small, and therefore stopped after bootstrapping in the PRE_VOTER state. To solve this issue, I had to add --remote_bootstrap_max_chunk_size=20000000
to the tablet server configuration file, this will be the subject of another blogpost.
Conclusion
This blogpost shows one of the key strengths of a YugabyteDB database cluster. Completely transparently to the database APIs, we simulated a failure of a node, which didn't interrupt availability of the database, and once we introduced a "replacement" node, the database repaired itself.