I like to reproduce a YugabyteDB deployment on a simple lab in my laptop because it is easier, and cheaper, to reproduce and analyze. However, when it comes to simulating a multi-region cluster, I need to introduce some relevant latencies in network calls. Here I'm showing how to do this with tc
(Traffic Control in the Linux kernel) from the iproute-tc
package in RedHat (or iproute2
in Debian).
Here is an example I used when I was looking at a specific deployment with all leaders in one site (DC1) and all followers in another (DC2). Please note that I publish it to show the method, but this is not a recommended deployment. Whatever the technology, synchronous replication between two data centers compromises the high availability because the network between them is a single point of failure. Combining HA (High Availability) and no-data-loss DR (Disaster Recovery) requires at least 3 Data Centers.
Starting a 3 nodes cluster
I start 3 containers, the minimum for a replication factor RF=3 YugabyteDB cluster, in two regions: DC1 and DC2. I install iproute-tc
on each of them and run the container with --cap-add NET_ADMIN
to get the necessary privileges to interact with the network stack:
docker network create yb
docker rm -f yb{0..2}
zone="dc1" # first node will be in DC1
for i in {0..2}
do
docker run --cap-add NET_ADMIN -d --name yb$i --hostname yb$i --net=yb -p543$i:5433 -p700$i:7000 yugabytedb/yugabyte:latest bash -c '
yugabyted start --listen $(hostname -i) --cloud_location lab.'$zone'.yb'$i' $(echo "--join yb0" | grep -v " $(hostname)$") --daemon=false
'
docker exec -it yb$i dnf -y install iproute-tc
zone="dc2" # next nodes will be in DC2
until docker exec -it yb$i postgres/bin/pg_isready -h yb$i ; do sleep 1 ; done
done
# set the data placement because `yugabyted` sets it to cloud1.datacenter1.rack1 and without a valid configuration the committed transactions will be lost (https://github.com/yugabyte/yugabyte-db/issues/16402)
docker exec -it yb0 yugabyted configure data_placement --constraint_value=lab.dc1.yb0,lab.dc2.yb1,lab.dc2.yb2
This is a RF=3 cluster over 3 zones that is resilient to one zone failure. In addition to that, to reduce the latency of cross-table transactions, I'll place all tablet leaders in one zone (Lab.DC1.yb0
) where the application will connect to. I'll do that later with tablespaces. However, even if tablespaces will control the placement of tables and indexes, you still need a valid configuration at cluster level, so don't forget the configure data_placement
above or you will encounter Issue #16402.
Here is my setup:
Docker network IP address | Container Hostname | Cloud.Region.Zone placement | placement preference |
---|---|---|---|
172.24.0.2 | yb0 | Lab.DC1.yb0 | leader |
172.24.0.3 | yb1 | Lab.DC2.yb1 | follower |
172.24.0.4 | yb2 | Lab.DC2.yb2 | follower |
Note that the screenshots have case sensitive names for Cloud/Region/Zone but I've actually used lowercase because of Issue #16401.
I can check the configuration from the port 7000 webconsole, the yb-master
leader is in yb0:
In 🔧Utilities > 🕛TServer Clocks I can check the latency from the yb-master
leader in yb0 to each yb-tserver
:
With Docker running on a single server, the latency is less than 1 millisecond.
Default tablespace with Leader preference
When testing in a lab, I want to set the exact configuration I want to investigate without leaving room to unpredictable configuration. Here I want to have all tablet leaders in DC1 and the two followers in DC2.
I set that with a tablespace data placement:
docker exec -i yb0 ysqlsh -h yb0 <<'SQL'
create tablespace twodc with ( replica_placement=$placement$
{
"num_replicas": 3,
"placement_blocks": [{
"cloud": "lab", "region": "dc1","zone": "yb0",
"min_num_replicas": 1,"leader_preference": 1
},{
"cloud": "lab", "region": "dc2", "zone": "yb1",
"min_num_replicas": 1
},{
"cloud": "lab", "region": "dc2", "zone": "yb2",
"min_num_replicas": 1
}
]}$placement$);
alter database yugabyte set default_tablespace = twodc;
SQL
I have set the tablespace as the default for this database.
PgBench tables
I create the pgbench
tables (called ysql_bench
is in the YugabyteDB fork of PostgreSQL):
docker exec -i yb0 postgres/bin/ysql_bench -i -h yb0
The tables are created with the default cluster placement but the load balancer quickly triggers the leader elections to fulfull the placement preferences. For one table, I can check that the 3 tablets has all leaders in DC1 (yb0
) and followers in DC2 (yb1
and yb2
):
I can check the same for all user tablets. PgBench has 4 tables, with the default of 3 tablets for a 3 nodes cluster:
PgBench TCP-B (sort of)
With low network latency, as all is running on Docker, pg_bench
has a latency average of 10 milliseconds per transaction:
docker exec -i yb0 postgres/bin/ysql_bench -T 10 -h yb0 -rn
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 847
maximum number of tries: 1
latency average = 11.815 ms
tps = 84.637063 (including connections establishing)
tps = 84.807832 (excluding connections establishing)
statement latencies in milliseconds:
0.015 \set aid random(1, 100000 * :scale)
0.004 \set bid random(1, 1 * :scale)
0.002 \set tid random(1, 10 * :scale)
0.002 \set delta random(-5000, 5000)
0.392 BEGIN;
2.694 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
1.090 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
1.837 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.861 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.767 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
2.127 END;
The -r
or --report-latencies
show the response time for each statement.
Adding latency with tc
I will simulate a 50ms latency when sending data from DC1 to DC2 by queuing them for 50 milliseconds on yb0
side:
docker exec -it yb0 tc qdisc add dev eth0 root netem delay 50ms
qdisc
is the "queueing discipline" scheduler where I add
a rule on the network interface eth0
with a 50ms delay
in the network emulator (netem
) egress (root
).
I can verify the expected latency with the heartbeats from the yb-master
leader (in yb0
) to each yb-tserver
:
The yb-master
leader is in yb0 in DC1 with less than 1ms latency to the yb-tserver
on the same container. Latency to the others, in DC2, is 50ms as it goes through the Traffic Control. Note that in a multi-region deployment a 50ms latency, due to the distance, is on both way and would observe a 100ms Round Trip Time (RTT). Here I throttled only the outgoing traffic (egress) as it is sufficient for my test, so it is equivalent to a 25ms latenc
PgBench TCP-B (sort of) with 50ms in quorum writes
Now, with the 50ms latency added with tc
, all write operations to the leaders in DC1 have to wait for the acknowledgment of one follower in DC2:
docker exec -i yb0 postgres/bin/ysql_bench -T 10 -h yb0 -rn
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 33
maximum number of tries: 1
latency average = 304.913 ms
tps = 3.279622 (including connections establishing)
tps = 3.286555 (excluding connections establishing)
statement latencies in milliseconds:
0.014 \set aid random(1, 100000 * :scale)
0.002 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.195 BEGIN;
69.594 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
1.219 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
64.238 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
52.679 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
53.052 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
63.271 END;
Each statement that writes (UPDATE, INSERT, and END which is a COMMIT) see an increased response time of 50 seconds, which brings the total transaction to more than 250ms. This is the current behavior (I'm testing this with YugabyteDB 2.17.1) where multiple rows can be batched into one write operation, but each write DML statement has to be acknowledged by the quorum.
PostgreSQL synchronous commit
I order to compare, I used the same technique to test PostgreSQL with physical standby in synchronous mode. I've run the same as in Testing Patroni strict synchronous mode but adding RUN apt-get update ; apt-get -y install iproute2 ; chmod a+s /sbin/tc
in Dockerfile and cap_add: - NET_ADMIN
so that I can easily docker exec -ti demo-patroni1 tc qdisc add dev eth0 root netem delay 50ms
. Here is the result of PgBench:
pgbench -T 10 -rn
pgbench (15.2 (Debian 15.2-1.pgdg110+1))
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 189
number of failed transactions: 0 (0.000%)
latency average = 53.001 ms
initial connection time = 2.701 ms
tps = 18.867653 (without initial connection time)
statement latencies in milliseconds and failures:
0.004 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
0.095 0 BEGIN;
0.311 0 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.151 0 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.154 0 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.138 0 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.118 0 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
52.023 0 END;
With monolithic databases, the Write Ahead Logging (WAL) is a single stream of sequential transaction log (I'm skipping the multiple threads in Oracle RAC here for simplicity). It is sufficient to wait at COMMIT to guarantee that all previous changes are replicated. The log shipping can be asynchronous as long as the one containing the log record is synchronous, piggybacking the previous changes.
Distributed SQL
As uncommitted changes will never have to be recovered, the same synchronous at commit only could work with Distributed SQL. However, to be scalable, each YugabyteDB tablet is its own Raft group, with its own replication. That explains why each write operation is synchronous even for provisional records. In Raft literature, the term Commit refers to each write batch. In SQL databases, Commit is one additional write with the transaction status to make it durable and visible by other sessions.
Note that this (waiting on quorum for each statement) may be improved in future versions of YugabyteDB. In a single region the latency is low enough to be acceptable, and waiting on each batch write operation is still better than block storage replication as you can find in AWS RDS Multi-AZ. In a multi-region deployment, two regions should be close enough, and one can be stretched further. For example with AWS regions in US with California - Oregon - Ohio, in Asia with Tokyo - Osaka - Seoul, or in Europe with Paris Frankfurt - Milan. This can also work with two on-premises data centers and one additional zone in a public cloud.