Crash on clock skew: performance vs availability

Franck Pachot - May 11 - - Dev Community

In my previous post, I explained how to artificially create clock skew when testing YugabyteDB in a lab using two Docker containers. I initiated one container to lag 499 milliseconds behind the wall clock. This works because, by default, the maximum clock skew allowed is 500 milliseconds, defined by --max_clock_skew_usec=500000.

This configuration prevents consistency violations by detecting when two values are within the 500-millisecond uncertainty range. The read operation is re-tried at a newer read time to guarantee the changes' sequential ordering. Performance is affected, but consistency is maintained.

Monitoring NTP synchronization

When NTP synchronization functions correctly, the clock should not drift more than 0.5 seconds. In the event of NTP synchronization failure, the clock may drift but at a slow rate. It is essential to fix the issue before the clock reaches a drift of 500 milliseconds, and this is not a problem if NTP is correctly monitored.

However, the clocks can be incorrect in some scenarios, such as after a virtual machine hibernation. It is crucial to avoid accepting transactions when the clock skew exceeds the defined limit since it can cause inconsistencies. If the YugabyteDB tablet server detects such a situation, it crashes to prevent inconsistencies.

The recommended solution is to monitor NTP synchronization. In case of failure, fix it before the skew exceeds 500 milliseconds or stop the server. An enhancement in future versions will refuse to start a tablet server that is not synchronized: https://github.com/yugabyte/yugabyte-db/issues/22255.

Crash when max clock skew is exceeded

By default, the YugabyteDB tablet server has a feature that causes it to crash if it detects an out-of-range clock skew. This is done to maintain consistency and prevent violations but can also affect availability. The Hybrid Logical Clock (HLC) synchronization process calculates an HLC time whenever two nodes communicate using the Lamport algorithm. HLC is increased to the maximum value of their local HLC, remote HLC, and local physical time. This ensures that the HLC never decreases. However, if the calculated HLC is more than 500 milliseconds higher than the local physical time, the node will crash to maintain maximum clock skew for external consistency.

This means that if one node's clock is set to more than 0.5 seconds in the future, all nodes that exchange messages with it will crash. The cluster will lose the quorum and become unavailable.

Demo on a lab

Let's test this. I started two nodes, yb1 and yb2:




docker network create yb

docker run -d  --rm --network yb --hostname yb1 -p 7000:7000 yugabytedb/yugabyte yugabyted start --background=false

until docker run -it --network yb yugabytedb/yugabyte postgres/bin/pg_isready -h yb1 ; do sleep 1 ; done | uniq -c

docker run -d  --rm --network yb --hostname yb2 yugabytedb/yugabyte yugabyted start --background=false --join yb1.yb



Enter fullscreen mode Exit fullscreen mode

As I did in the previous blog post, I started a third server with an artificial clock skew. Here, I set the clock 900 seconds in the future (15 minutes):



docker run -it --rm --network yb --hostname yb3 yugabytedb/yugabyte bash

cat > fake_clock_gettime.c <<'C'
#define _GNU_SOURCE
#include <stdlib.h>
#include <dlfcn.h>
int clock_gettime(clockid_t clk_id, struct timespec *tp)
{
  static int (*origin_clock_gettime)();
  static int ret;
  if(!origin_clock_gettime) {
   origin_clock_gettime = (int (*)()) dlsym(RTLD_NEXT, "clock_gettime");
  }
  ret=origin_clock_gettime(clk_id,tp);
  tp->tv_sec += 900;
  return(ret);
}
C
dnf install -y gcc
gcc -o fake_clock_gettime.so -fPIC -shared fake_clock_gettime.c -ldl

LD_PRELOAD=$PWD/fake_clock_gettime.so yugabyted start --join yb1.yb



Enter fullscreen mode Exit fullscreen mode

This node, yb3, with IP 192.168.48.4, has its physical time in advance by 15 minutes:
Image description

The nodes are not yet synchronized in terms of Hybrid Logical Time. The actual clock time is 19:14:40, and only 192.168.48.4 is 15 minutes ahead. After a few minutes, this has changed:

Image description

At 19:19:21, the yb-master did not update the problematic node's status. However, all other nodes have synchronized their Hybrid Logical Clock with it, which is 15 minutes higher than their physical time. When the nodes have the same physical component in the HLC, a logical number is increased when they communicate.

Availability loss

The cluster seems to be available, with the majority of servers running. Let's attempt to create a table:



[root@yb3]# ysqlsh -h yb1 -c 'create table demo (a int)'
WARNING:  AbortTransaction while in ABORT state
ERROR:  Shutdown connection
ERROR:  Shutdown connection
ERROR:  Shutdown connection
PANIC:  ERRORDATA_STACK_SIZE exceeded
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost



Enter fullscreen mode Exit fullscreen mode

It failed, and the creation is pending when we look at the list of tablets:
Image description

The nodes enter a crash loop, visible from the "time since heartbeat" increasing and reset:
Image description

Two parameters control this crash, and you should not change them because running with clock skew would compromise the consistency of transactions. fail_on_out_of_range_clock_skew=true enables crashing when the maximum clock skew is reached. Even if this one were disabled, clock_skew_force_crash_bound_usec set to 60 seconds initiates a crash, which we see in the logs:



E0511 19:15:10.818786   203 hybrid_clock.cc:181] Too big clock skew is detected: 900.000s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000


Enter fullscreen mode Exit fullscreen mode

When yb3 is stopped, it might seem that the cluster is now available again since yb1 and yb2 can form a quorum and their physical clocks are in sync with the actual time. However, these two nodes still need to synchronize their Hybrid Logical Clocks, and they follow the same rule: if the HLC exceeds their physical time by more than the maximum clock skew, they crash.

As it is set from the maximums, HLC remains the same, with only the logical part being incremented, but the physical clock moves forward. This crash loop continues for 15 minutes until the physical time reaches the HLC time.

This can be seen in the log as the clock skew decreases while the clock advances:



[root@yb1 tserver]# grep hybrid_clock  yb-tserver.yb1.root.log.INFO.20240511-191314.99
E0511 19:15:10.818786   203 hybrid_clock.cc:181] Too big clock skew is detected: 900.000s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:11.902853   101 hybrid_clock.cc:181] Too big clock skew is detected: 899.473s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:12.976296   101 hybrid_clock.cc:181] Too big clock skew is detected: 899.897s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:14.172662   101 hybrid_clock.cc:181] Too big clock skew is detected: 899.614s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:15.285753   122 hybrid_clock.cc:181] Too big clock skew is detected: 899.790s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:16.285974   120 hybrid_clock.cc:181] Too big clock skew is detected: 899.608s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:17.286403   119 hybrid_clock.cc:181] Too big clock skew is detected: 899.934s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:18.286665   120 hybrid_clock.cc:181] Too big clock skew is detected: 899.936s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:19.286969   119 hybrid_clock.cc:181] Too big clock skew is detected: 899.989s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:15:20.287256   119 hybrid_clock.cc:181] Too big clock skew is detected: 899.982s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
...
E0511 19:35:15.593176 10353 hybrid_clock.cc:181] Too big clock skew is detected: 6.803s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
F0511 19:35:15.613420 10353 hybrid_clock.cc:177] Too big clock skew is detected: 6.793s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
F0511 19:35:15.634792 10372 hybrid_clock.cc:177] Too big clock skew is detected: 6.793s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
E0511 19:35:21.108741 10475 hybrid_clock.cc:181] Too big clock skew is detected: 1.287s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
F0511 19:35:21.127643 10491 hybrid_clock.cc:177] Too big clock skew is detected: 1.283s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000
F0511 19:35:21.150816 10475 hybrid_clock.cc:177] Too big clock skew is detected: 1.251s, while max allowed is: 0.500s; clock_skew_force_crash_bound_usec=60000000


Enter fullscreen mode Exit fullscreen mode

After 15 minutes, the physical clock is above the maximum clock skew from the HLC that was increased by mistake when 'yb3' was up with a clock in the future, and the cluster is available again:
Image description

The tablets have finally been created automatically.
Image description

If a new node is added, three additional tablet peers will be created. It took a long to recover, but the situation I created in this lab, with a 15 minutes clock drift, should not have happened in real life.

In summary

In a distributed database, NTP synchronization is essential and should be carefully monitored and fixed in case of any failures. To allow some time drift, a maximum clock skew is set. This skew should be kept low enough for performance to avoid too many read retries and high enough for availability to avoid any node evictions caused by network errors. It is a good idea to check the NTP synchronization when starting a YugabyteDB node. This will be implemented by 22255. 500 millisecond is a safe value for maximum clock skew. You can lower it if you are synchronized on atomic clocks

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .