To achieve scalability, a distributed database must apply read and write operations without synchronizing with other cluster nodes. For example, in YugabyteDB, two transactions can be processed simultaneously by different nodes. Each node sends its read and write operations to other nodes based on the sharding keys of the table rows, index entries, or transaction status.
Although these operations occur independently, they must preserve the correct order of events. For instance, if a deposit is made to your account before a withdrawal, it's essential that the bank accurately reflects this order. Reading operations in the same sequence executed is crucial for maintaining consistency and preventing issues such as stale reads.
In monolithic databases or clusters sharing a low-latency private interconnect network, assigning a monotonically increasing number to sequence the database states is straightforward. Oracle Database, for example, uses the System Change Number (SCN) for this purpose. Some distributed databases, like TiDB use a TimeStamp Oracle, which limits the scalability, especially when deploying to multiple regions, but provides precise ordering.
To achieve linear scalability and eliminate any single point of truth, YugabyteDB utilizes the wall clock, accessible in all systems. However, clock drift can occur, necessitating synchronization between the nodes. This synchronization is inherently imperfect, so the database must account for the uncertainty arising from the hardware and software involved in clock synchronization and the network used for this process. YugabyteDB adds a logical clock (Lamport) to physical time for node synchronization during message exchange. However, physical time remains crucial for global ordering.
In this blog post, we will explore the potential levels of time synchronization available in an Amazon EC2 instance so that we can run YugabyteDB with high performance.
Time Synchronisation
Here is a description of the time synchronization levels that can be achieved in Amazon EC2 instances.
Network Time Protocol (NTP)
NTP is the default protocol for adjusting the internal electronic clock, which can drift. It synchronizes time with reference NTP servers that provide more accurate time using atomic clocks, either locally or via GPS signals. When the time depends on Network Time Protocol (NTP) synchronization, there is no guarantee of clock skew because there's unpredictable network latency in the roundtrip to these servers.
YugabyteDB establishes a maximum allowable clock skew of 0.5 seconds (--max_clock_skew_usec=500000
) to ensure safety. It treats the wall clock as reliable for comparing two timestamps only if the time difference exceeds this threshold. If any server detects a clock skew exceeding this limit, it will immediately crash to prevent inconsistencies.
This affects performance: if a read operation detects a modification made within 500 milliseconds of its transaction read-time, it cannot accurately determine whether it occurred earlier or later. As a result, the read operation will restart with a newer timestamp to ensure accuracy. Availability is also affected, as infrastructure failures affecting clock synchronization can render a node unavailable.
AWS Time Sync Service
Amazon uses atomic clocks and GPS signals to provide more precise time in each region. This time is available through NTP to any instance in the VPC, accessible through the 169.254.169.123
address. It is configured by default in EC2 instances, and you can verify it with chronyc.
I've run this in a simple cloud shell:
[root@ip-10-134-32-254 sysconfig]# sudo yum install chrony
Last metadata expiration check: 0:08:43 ago on Mon 18 Nov 2024 07:55:53 AM UTC.
Package chrony-4.3-1.amzn2023.0.4.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
[root@ip-10-134-32-254 sysconfig]# chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* 169.254.169.123 3 4 377 5 +95us[ +97us] +/- 714us
^- ec2-35-176-149-124.eu-we> 4 6 377 7 +297us[ +299us] +/- 6363us
^- ec2-3-8-121-220.eu-west-> 4 6 377 5 -49us[ -47us] +/- 5324us
^- ec2-13-40-182-125.eu-wes> 4 6 377 6 -394us[ -392us] +/- 5660us
^- ec2-18-133-139-197.eu-we> 4 6 377 6 +69us[ +71us] +/- 6213us
The Amazon Time Sync NTP servers are also available to the public internet through time.aws.com
.
Amazon Time Sync provides sub-millisecond clock accuracy. If the service fails to provide a bound, the system reports an error, and YugabyteDB nodes crash if such an error is reported. However, YugabyteDB still recommends setting the maximum clock skew to 500ms to be safe when there's no higher guarantee of precision to allow failures from the clock.
AWS Time Sync Service with Precision Hardware Clock (PHC)
At Re:invent 2023, Amazon Web Services announced the addition of a Precision Hardware Clock (PHC) to the hypervisor (Nitro) to provide a more accurate time source with Time Sync Service. NTP transparently uses this clock to obtain less than millisecond accurate time on instances that support PHC, typically under 100 microseconds.
To test it, I use a recent instance type (C7i, M7i, R7i, C7a, M7a, R7a, M7g) from a region supporting PHC. I'm using m7i.large
in us-east-1
and checking the Time Sync precision though NTP:
[ec2-user@ip-172-31-12-153 ~]$ chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^- ip212-227-240-160.pbiaas> 2 6 17 3 -161us[ -166us] +/- 33ms
^- vps-4e90522b.vps.ovh.us 2 6 17 2 -195us[ -195us] +/- 68ms
^- time2.tritan.host 2 6 17 2 +83us[ +83us] +/- 39ms
^- clock.xmission.com 1 6 17 2 +4734us[+4734us] +/- 36ms
^* 169.254.169.123 1 4 17 2 -4952ns[-9826ns] +/- 98us
[ec2-user@ip-172-31-12-153 ~]$ chronyc tracking | awk '$0~re{$0=$0gensub(re," (0.\\1 m\\2)",1)}{print}' re='.*0.000([0-9]+) (s)econds.*'
Reference ID : A9FEA97B (169.254.169.123)
Stratum : 2
Ref time (UTC) : Mon Nov 18 08:46:53 2024
System time : 0.000000004 seconds fast of NTP time (0.4 ms)
Last offset : -0.000004874 seconds (0.4874 ms)
RMS offset : 0.000004874 seconds (0.4874 ms)
Frequency : 5.349 ppm fast
Residual freq : +0.018 ppm
Skew : 8.524 ppm
Root delay : 0.000157363 seconds (0.157363 ms)
Root dispersion : 0.000105599 seconds (0.105599 ms)
Update interval : 1.4 seconds
Leap status : Normal
This is automatic in an EC2 instance, provided it is recent hardware. Chrony can synchronize the system clock with PHC through NTP.
AWS Time Sync Service with Precision Time Protocol (PTP)
Chrony can synchronize the clock using the Precision Time Protocol (PTP) to achieve greater precision, typically under 40 microseconds. To do so, you need to install the latest Elastic Network Adapter (ENA) driver.
Then chrony must be configured to use PHC. This is set by adding refclock PHC /dev/ptp0 poll 0 delay 0.000010 prefer
in /etc/chrony.conf
(poll 0
is continuous time keeping and delay 0.000010
is 10 milliseconds to account for network or processing latency).
On AlmaLinux OS 8, I use the scripts provided by YugabyteDB to install ENA drivers and get NTP accessing the PTP Hardware Clock:
curl -Ls https://raw.githubusercontent.com/yugabyte/yugabyte-db/refs/heads/master/bin/configure_ptp.sh |
sudo bash -x
The PHC source is now available with a higher precision:
[ec2-user@ip-172-31-12-153 ~]$ chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PHC0 0 0 373 0 +401ns[ +426ns] +/- 5030ns
^- triton.ellipse.net 2 6 37 10 -5061us[-5061us] +/- 31ms
^- zero.txryan.com 2 6 37 10 -2396ns[-2283ns] +/- 1391us
^- server.slakjd.com 3 6 37 9 -223us[ -223us] +/- 18ms
^- 44.190.5.123 2 6 37 9 +188us[ +188us] +/- 37ms
^- 169.254.169.123 1 4 377 9 +147us[ +147us] +/- 227us
[ec2-user@ip-172-31-12-153 ~]$ chronyc tracking | awk '$0~re{$0=$0gensub(re," (0.\\1 m\\2)",1)}{print}' re='.*0.000([0-9]+) (s)econds.*'
Reference ID : 50484330 (PHC0)
Stratum : 1
Ref time (UTC) : Mon Nov 18 08:53:47 2024
System time : 0.000000035 seconds fast of NTP time (0.35 ms)
Last offset : +0.000000027 seconds (0.27 ms)
RMS offset : 0.000000090 seconds (0.90 ms)
Frequency : 5.013 ppm fast
Residual freq : +0.001 ppm
Skew : 0.028 ppm
Root delay : 0.000010000 seconds (0.0 ms)
Root dispersion : 0.000001070 seconds (0.1070 ms)
Update interval : 1.0 seconds
Leap status : Normal
[ec2-user@ip-172-31-12-153 ~]$ sudo cat /sys/devices/pci0000:24/0000:24:00.0/0000:25:00.0/0000:26:00.0/0000:27:00.0/phc_error_bound
14023
AWS Time Sync Service with Precision Time Protocol (PTP) and ClockBound
When the consistency of your transactions depends on it, it is not only important to have a precise time. We need to guarantee the ordering of events. This requires knowing the maximum possible error, with a 100% guarantee. The calculation of the maximum clock error bound must add the error bound calculated by the hypervisor (Nitro) and exposed by phc_error_bound
.
I install and configure the clock bound service using the scripts for YugabyteDB:
curl -Ls https://raw.githubusercontent.com/yugabyte/yugabyte-db/refs/heads/master/bin/configure_clockbound.sh |
sudo bash -x
As I am in a lab, I can test it with the C example from the ClockBound repo:
git clone https://github.com/aws/clock-bound.git
# Build the ClockBound Foreign Function Interface libraries
cd clock-bound/clock-bound-ffi
curl https://sh.rustup.rs -sSf | sh <<<""
. "$HOME/.cargo/env"
cargo build --release
sudo cp include/clockbound.h /usr/include/
sudo cp ../target/release/libclockbound.a /usr/lib
sudo cp ../target/release/libclockbound.so /usr/lib
cd -
# compile and run the example
cd clock-bound/examples/c/src
gcc clockbound_now.c -o clockbound_now -I/usr/include -L/usr/lib -lclockbound
LD_LIBRARY_PATH=/usr/lib ./clockbound_now ; echo
Here is the output, showing the clock bound as well as the performance of the calls:
[ec2-user@ip-172-31-12-153 src]$ LD_LIBRARY_PATH=/usr/lib ./clockbound_now
When clockbound_now was called true time was somewhere within 1731939946.194796388 and 1731939946.194930750 seconds since Jan 1 1970. The clock status is SYNCHRONIZED.
It took 7.067347211 seconds to call clock bound 100000000 times (14149580 tps).
[ec2-user@ip-172-31-12-153 src]$ LD_LIBRARY_PATH=/usr/lib ./clockbound_now ; echo
When clockbound_now was called true time was somewhere within 1731939963.817817253 and 1731939963.817913367 seconds since Jan 1 1970. The clock status is SYNCHRONIZED.
It took 7.102227815 seconds to call clock bound 100000000 times (14080089 tps).
This establishes a true time as both a lower and upper bound. A distributed database can utilize this to compare read and write times, with a minimal uncertaininty window, which helps minimize read retries at their maximum.
Distributed SQL Database
Once ClockBound and PHC are available, YugabyteDB can use them to replace the 500ms default clock skew conservative setting. I can start YugabyteDB with gigabyte start --enhance_time_sync_via_clockbound
, which defines the time source as CloudBound instead of the Wall Clock.
curl -Ls https://downloads.yugabyte.com/releases/2.23.1.0/yugabyte-2.23.1.0-b220-linux-x86_64.tar.gz | tar xzvf -
sudo dnf install -y python39
yugabyte-2.23.1.0/bin/yugabyted start --enhance_time_sync_via_clockbound --advertise_address=$(hostname)
It detects the correct configuration of ClockBound:
I use the psql
shipped with YugabyteDB to run PostgreSQL statements:
PGHOST=$(hostname) ./yugabyte-2.23.1.0/bin/ysqlsh
Here is what I've used to test the benefits of using ClockBound as a time source. The following creates a table with one row and a background connection updates it continuously. I disable the query layer retries (set yb_max_query_layer_retries=0
is set to 60 by default) and run a query to read the table:
\c
drop table demo;
create table demo ( id bigserial primary key, value int, ts timestamptz default clock_timestamp() );
insert into demo(value) values (1);
\! sleep 1 ; echo -e 'update demo set value=value+1\n\\watch 0.001' | timeout 60 ./yugabyte-2.23.1.0/bin/ysqlsh >/dev/null & sleep 1
set yb_max_query_layer_retries=0;
prepare q as select * from demo;
explain analyze execute q;
\watch 0.0001
If you run this with the wall clock as a time source, YugabyteDB uses a conservative 500ms maximum clock skew. As my query reads a state that was updated less than 500ms ago, and retries are disabled, it gets the following error:
ERROR: Restart read required at: { read: { physical: 1731940780361106 } local_limit: { physical: 1731940780361106 }
global_limit: <min> in_txn_limit: <max> serial_no: 0 } (yb_max_query_layer_retries set to 0 are exhausted)
With the default 60 retries, the probability of such an error is much lower, but the latency increases with retries.
When you run this with ClockBound as a time source, the query will never fail, even with no retries. The reason is that the probability of an update made 100 microseconds before the read is very low.
Note that it is not recommended to disable query layer retries. I did that to demonstrate the benefit of running a distributed SQL database with precise time. Just keep the default, and the benefit will be visible in the performance during the high concurrency peaks.
Conclusion
Thanks to Precision Hardware Clocks, the nodes of a distributed database can operate with a synchronized clock without incurring the costs associated with network synchronization. ClockBound ensures that clock error is kept within a guaranteed precision that does not negatively affect performance.
One significant advantage of a public cloud infrastructure is its ability to offer easily provisioned hardware that would be nearly impossible to have in an on-premises data center.
AWS Time Sync Service and Amazon EC2 utilize both hardware and software solutions to facilitate the use of Precision Hardware Clocks (PHC) through Network Time Protocol (NTP) or Precision Time Protocol (PTP). Databases like Aurora Limitless and YugabyteDB can leverage this technology to maintain consistency at high-performance levels.