Simulate network latency in a YugabyteDB cluster, on a Docker lab

Franck Pachot - Mar 12 '23 - - Dev Community

I like to reproduce a YugabyteDB deployment on a simple lab in my laptop because it is easier, and cheaper, to reproduce and analyze. However, when it comes to simulating a multi-region cluster, I need to introduce some relevant latencies in network calls. Here I'm showing how to do this with tc (Traffic Control in the Linux kernel) from the iproute-tc package in RedHat (or iproute2 in Debian).

Here is an example I used when I was looking at a specific deployment with all leaders in one site (DC1) and all followers in another (DC2). Please note that I publish it to show the method, but this is not a recommended deployment. Whatever the technology, synchronous replication between two data centers compromises the high availability because the network between them is a single point of failure. Combining HA (High Availability) and no-data-loss DR (Disaster Recovery) requires at least 3 Data Centers.

Starting a 3 nodes cluster

I start 3 containers, the minimum for a replication factor RF=3 YugabyteDB cluster, in two regions: DC1 and DC2. I install iproute-tc on each of them and run the container with --cap-add NET_ADMIN to get the necessary privileges to interact with the network stack:



docker network create yb
docker rm -f yb{0..2}
zone="dc1" # first node will be in DC1
for i in {0..2}
do
docker run --cap-add NET_ADMIN -d --name yb$i --hostname yb$i --net=yb -p543$i:5433 -p700$i:7000 yugabytedb/yugabyte:latest bash -c '
yugabyted start --listen $(hostname -i) --cloud_location lab.'$zone'.yb'$i' $(echo "--join yb0" | grep -v " $(hostname)$") --daemon=false 
'
docker exec -it yb$i dnf -y install iproute-tc
zone="dc2" # next nodes will be in DC2
until docker exec -it yb$i postgres/bin/pg_isready -h yb$i ; do sleep 1 ; done
done

# set the data placement because `yugabyted` sets it to cloud1.datacenter1.rack1 and without a valid configuration the committed transactions will be lost (https://github.com/yugabyte/yugabyte-db/issues/16402)
docker exec -it yb0 yugabyted configure data_placement --constraint_value=lab.dc1.yb0,lab.dc2.yb1,lab.dc2.yb2



Enter fullscreen mode Exit fullscreen mode

This is a RF=3 cluster over 3 zones that is resilient to one zone failure. In addition to that, to reduce the latency of cross-table transactions, I'll place all tablet leaders in one zone (Lab.DC1.yb0) where the application will connect to. I'll do that later with tablespaces. However, even if tablespaces will control the placement of tables and indexes, you still need a valid configuration at cluster level, so don't forget the configure data_placement above or you will encounter Issue #16402.

Here is my setup:

Docker network IP address Container Hostname Cloud.Region.Zone placement placement preference
172.24.0.2 yb0 Lab.DC1.yb0 leader
172.24.0.3 yb1 Lab.DC2.yb1 follower
172.24.0.4 yb2 Lab.DC2.yb2 follower

Note that the screenshots have case sensitive names for Cloud/Region/Zone but I've actually used lowercase because of Issue #16401.

I can check the configuration from the port 7000 webconsole, the yb-master leader is in yb0:
Image description

In 🔧Utilities > 🕛TServer Clocks I can check the latency from the yb-master leader in yb0 to each yb-tserver:
Image description

With Docker running on a single server, the latency is less than 1 millisecond.

Default tablespace with Leader preference

When testing in a lab, I want to set the exact configuration I want to investigate without leaving room to unpredictable configuration. Here I want to have all tablet leaders in DC1 and the two followers in DC2.

I set that with a tablespace data placement:



docker exec -i yb0 ysqlsh -h yb0 <<'SQL'
create tablespace twodc with ( replica_placement=$placement$
{
 "num_replicas": 3,
 "placement_blocks": [{
   "cloud": "lab", "region": "dc1","zone": "yb0",
   "min_num_replicas": 1,"leader_preference": 1
  },{
   "cloud": "lab", "region": "dc2", "zone": "yb1",
   "min_num_replicas": 1
  },{
   "cloud": "lab", "region": "dc2", "zone": "yb2",
   "min_num_replicas": 1
}
]}$placement$);

alter database yugabyte set default_tablespace = twodc;

SQL


Enter fullscreen mode Exit fullscreen mode

I have set the tablespace as the default for this database.

PgBench tables

I create the pgbench tables (called ysql_bench is in the YugabyteDB fork of PostgreSQL):



docker exec -i yb0 postgres/bin/ysql_bench -i -h yb0



Enter fullscreen mode Exit fullscreen mode

The tables are created with the default cluster placement but the load balancer quickly triggers the leader elections to fulfull the placement preferences. For one table, I can check that the 3 tablets has all leaders in DC1 (yb0) and followers in DC2 (yb1 and yb2):

Image description

I can check the same for all user tablets. PgBench has 4 tables, with the default of 3 tablets for a 3 nodes cluster:
Image description

PgBench TCP-B (sort of)

With low network latency, as all is running on Docker, pg_bench has a latency average of 10 milliseconds per transaction:




docker exec -i yb0 postgres/bin/ysql_bench -T 10 -h yb0 -rn

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 847
maximum number of tries: 1
latency average = 11.815 ms
tps = 84.637063 (including connections establishing)
tps = 84.807832 (excluding connections establishing)
statement latencies in milliseconds:
         0.015  \set aid random(1, 100000 * :scale)
         0.004  \set bid random(1, 1 * :scale)
         0.002  \set tid random(1, 10 * :scale)
         0.002  \set delta random(-5000, 5000)
         0.392  BEGIN;
         2.694  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         1.090  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.837  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.861  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.767  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         2.127  END;



Enter fullscreen mode Exit fullscreen mode

The -r or --report-latencies show the response time for each statement.

Adding latency with tc

I will simulate a 50ms latency when sending data from DC1 to DC2 by queuing them for 50 milliseconds on yb0 side:



docker exec -it yb0 tc qdisc add dev eth0 root netem delay 50ms



Enter fullscreen mode Exit fullscreen mode

qdisc is the "queueing discipline" scheduler where I adda rule on the network interface eth0 with a 50ms delay in the network emulator (netem) egress (root).

I can verify the expected latency with the heartbeats from the yb-masterleader (in yb0) to each yb-tserver:

Image description

The yb-master leader is in yb0 in DC1 with less than 1ms latency to the yb-tserver on the same container. Latency to the others, in DC2, is 50ms as it goes through the Traffic Control. Note that in a multi-region deployment a 50ms latency, due to the distance, is on both way and would observe a 100ms Round Trip Time (RTT). Here I throttled only the outgoing traffic (egress) as it is sufficient for my test, so it is equivalent to a 25ms latenc

PgBench TCP-B (sort of) with 50ms in quorum writes

Now, with the 50ms latency added with tc, all write operations to the leaders in DC1 have to wait for the acknowledgment of one follower in DC2:




docker exec -i yb0 postgres/bin/ysql_bench -T 10 -h yb0 -rn

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 33
maximum number of tries: 1
latency average = 304.913 ms
tps = 3.279622 (including connections establishing)
tps = 3.286555 (excluding connections establishing)
statement latencies in milliseconds:
         0.014  \set aid random(1, 100000 * :scale)
         0.002  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.195  BEGIN;
        69.594  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         1.219  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
        64.238  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
        52.679  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
        53.052  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        63.271  END;



Enter fullscreen mode Exit fullscreen mode

Each statement that writes (UPDATE, INSERT, and END which is a COMMIT) see an increased response time of 50 seconds, which brings the total transaction to more than 250ms. This is the current behavior (I'm testing this with YugabyteDB 2.17.1) where multiple rows can be batched into one write operation, but each write DML statement has to be acknowledged by the quorum.

PostgreSQL synchronous commit

I order to compare, I used the same technique to test PostgreSQL with physical standby in synchronous mode. I've run the same as in Testing Patroni strict synchronous mode but adding RUN apt-get update ; apt-get -y install iproute2 ; chmod a+s /sbin/tc in Dockerfile and cap_add: - NET_ADMIN so that I can easily docker exec -ti demo-patroni1 tc qdisc add dev eth0 root netem delay 50ms. Here is the result of PgBench:



pgbench -T 10 -rn

pgbench (15.2 (Debian 15.2-1.pgdg110+1))
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 189
number of failed transactions: 0 (0.000%)
latency average = 53.001 ms
initial connection time = 2.701 ms
tps = 18.867653 (without initial connection time)
statement latencies in milliseconds and failures:
         0.004           0  \set aid random(1, 100000 * :scale)
         0.001           0  \set bid random(1, 1 * :scale)
         0.001           0  \set tid random(1, 10 * :scale)
         0.001           0  \set delta random(-5000, 5000)
         0.095           0  BEGIN;
         0.311           0  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.151           0  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.154           0  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.138           0  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.118           0  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        52.023           0  END;


Enter fullscreen mode Exit fullscreen mode

With monolithic databases, the Write Ahead Logging (WAL) is a single stream of sequential transaction log (I'm skipping the multiple threads in Oracle RAC here for simplicity). It is sufficient to wait at COMMIT to guarantee that all previous changes are replicated. The log shipping can be asynchronous as long as the one containing the log record is synchronous, piggybacking the previous changes.

Distributed SQL

As uncommitted changes will never have to be recovered, the same synchronous at commit only could work with Distributed SQL. However, to be scalable, each YugabyteDB tablet is its own Raft group, with its own replication. That explains why each write operation is synchronous even for provisional records. In Raft literature, the term Commit refers to each write batch. In SQL databases, Commit is one additional write with the transaction status to make it durable and visible by other sessions.

Note that this (waiting on quorum for each statement) may be improved in future versions of YugabyteDB. In a single region the latency is low enough to be acceptable, and waiting on each batch write operation is still better than block storage replication as you can find in AWS RDS Multi-AZ. In a multi-region deployment, two regions should be close enough, and one can be stretched further. For example with AWS regions in US with California - Oregon - Ohio, in Asia with Tokyo - Osaka - Seoul, or in Europe with Paris Frankfurt - Milan. This can also work with two on-premises data centers and one additional zone in a public cloud.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .