Watching a YugabyteDB table: majority failure

Frits Hoogland - Jan 19 '23 - - Dev Community

After looking at the replication factor, tables, tablets, replicas and single replica failure, let's look at a scenario where multiple replica's have become unavailable.

The first thing to say about that is that using the replication factor (RF), you can increase failure tolerance by increasing the the replication factor: a replication factor of 5 allows 2 replicas to become unavailable, and still have a majority for the remaining, available, replicas: 3. With a replication factor of 7 this gets up to 3, etcetera.

The replication factor for the cluster/universe that has read/write access must be an odd number because that forces an uneven split of a group, which forces a single group to be a majority at all times.

Observing replica availability issues: majority failure

What happens when a majority of the replicas fail? This scenario can be accomplished quite simply in an RF3 situation with three tablet server by stopping two tablet servers, so the surviving tablet server, and thus any replica on it, is in a minority.

Create a test table:

create table test (id int primary key, f1 text) split into 1 tablets;
Enter fullscreen mode Exit fullscreen mode

Validate test table:

➜ yb_stats --print-entities --table-name-match test
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.e159893575f2476f86751c1448701aef state: RUNNING
    Replicas: (yb-1.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER), yb-3.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

Table: ysql.yugabyte.test
Tablet: ysql.yugabyte.test.e159893575f2476f86751c1448701aef
Replicas: yb-1.local, yb-2.local, yb-3.local, where the replica on yb-1.local is leader.

Now stop the tablet servers on nodes yb-1.local and yb-2.local.

Please mind, for the the sake of simplicity and to avoid having multiple mechanisms in play, that I explicitly stop the tablet servers, not the entire nodes, which would mean the master servers would be stopped too.

This is how the situation with stopped tablet servers looks like:

The tablet servers appear ALIVE at first, and the only thing from which you could derive they are not actually up anymore is the 'HB time' (heartbeat time) going up, to a value higher than one or a few seconds:

➜ yb_stats --print-tablet-servers
yb-1.local:9000      ALIVE Placement: local.local.local1
                     HB time: 7.3s, Uptime: 25, Ram 38.80 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 209428480 (1.95%)
yb-2.local:9000      ALIVE Placement: local.local.local2
                     HB time: 6.4s, Uptime: 30, Ram 34.60 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 1/1, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 207933440 (1.94%)
yb-3.local:9000      ALIVE Placement: local.local.local3
                     HB time: 0.4s, Uptime: 3473, Ram 45.49 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 191819776 (1.79%)
Enter fullscreen mode Exit fullscreen mode

In the table view, everything appears to be fine, whilst the majority of the tablet servers is unavailable:

➜ yb_stats --print-entities --table-name-match test
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.e159893575f2476f86751c1448701aef state: RUNNING
    Replicas: (yb-1.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER), yb-3.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

After the 60 seconds threshold for determining a tablet server to be DEAD, the status changes to DEAD:

➜ yb_stats --print-tablet-servers
yb-1.local:9000      DEAD Placement: local.local.local1
                     HB time: 136.7s, Uptime: 0, Ram 0 B
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 1/1, system (leader/total): 8/24
yb-2.local:9000      DEAD Placement: local.local.local2
                     HB time: 136.7s, Uptime: 0, Ram 0 B
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
yb-3.local:9000      ALIVE Placement: local.local.local3
                     HB time: 0.1s, Uptime: 2205, Ram 21.69 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 25, user (leader/total): 0/1, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 175554560 (1.64%)
Enter fullscreen mode Exit fullscreen mode

After this timeout, the master fills the dead_nodes and under_replicated_tablets lists in master/api/v1/health-check, so --print-entities can inject this into the entities view:

➜ yb_stats --print-entities --table-name-match test
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.e159893575f2476f86751c1448701aef state: RUNNING [UNDER REPLICATED]
    Replicas: (yb-1.local:9100(VOTER:LEADER[DEAD]), yb-2.local:9100(VOTER[DEAD]), yb-3.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

Nothing else will happen. The dead nodes are not removed, the tablet leader remains on the node/tablet server where it was. While the tablet will STOP functioning, which is conflicting with the LEADER state being shown!!

What is happening here?

First of all, when a majority for a RAFT group, such as the tablet that can be seen above gets unavailable, it depends on whether any surviving replica serves as LEADER or not.

  • If a LEADER is among the surviving replicas, it might act on a read request, or try to get consensus for a write request, which it will not get. If its so-called 'LEADER lease' time expires, it will force an election, for which no majority of votes can be obtained, and the status is reverted to FOLLOWER.
  • If no LEADER is among the surviving replicas, the so-called 'LEADER lease' timer will expire on the FOLLOWER too, which will force an election, for which no majority of votes can be obtained.

The default LEADER lease time is 2 seconds, so within 2 seconds the tablet (alias RAFT group) will become unavailable because of minority.

However, after RAFT protecting the state, the current replica states are not reflected in via the HTTP endpoints. Instead, the HTTP endpoints reflect the last known consistent 'good' state of the RAFT group.

The annotation that yb_stats --print-entities provides by combining the entity information with the master/api/v1/health-check information gives more insight to the actual current situation, but it requires interpretation.

One of the most important things to realise is that the RAFT group will loose the LEADER, and thus make the RAFT group/tablet unavailable, whilst the status of the RAFT group still shows a LEADER, which might even be on a node that is not marked 'DEAD'. This is because that status is not the current status, a minority will lead to loss of the RAFT LEADER in 2 seconds, which will terminate the ability to use the tablet.

The next obvious question is: how can I restore the tablet and thus restore access to it?

If possible: startup a number of replicas so the RAFT group can get a majority, essentially independent which replica this is. As soon as a majority is available, it can apply the changes to get all replicas to current state, and elect a LEADER.

Otherwise a backup of the replica data must be restored, so it can be recovered to current.

It is not possible to start copying around one of the remaining replicas, because without a majority of the actual independent replicas, it's impossible to determine the last state of the data that the RAFT group protects.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .