Going to the cloud gives the impression that you don't have to understand the concepts ("how it works") because the cloud provider takes care about all details, like the high availability of a cloud service, a database for example. However, the high availability of the database doesn't always guarantee the high availability of your application, and the consistency of data across failures. Where public cloud makes it more difficult is that you cannot test the failure scenarios. In your premises, you can test the database failover by switching off, during a downtime window, the whole datacenter (I did it in the past and you would be surprised to see what can go wrong when the servers and routers do not stop in the same sequence as what you expected). The good thing is that there are ways to reproduce issues in a lab with containers and simulate failures.
In this series about High Availability with traditional databases (the supposedly well-known primary/standby model) I will test PostgreSQL with Patroni on a zone failure. For this, each zone will be a set of containers (PostgreSQL and etcd here for Patroni). The important point is that I'm not testing for a server failure, but for network partition where all servers are up but communication is lost between zones.
Start the 3-zones configuration
The following is similar to the previous post where I setup a 3 nodes Patroni cluster, with the Primary as patroni1
and two standby databases, with sync commit:
git clone https://github.com/zalando/patroni.git
cd patroni
docker build -t patroni .
docker-compose up -d
sleep 30
docker exec -i demo-patroni1 patronictl edit-config --apply - --force <<'JSON'
{
synchronous_mode: "on",
synchronous_mode_strict: "on",
"postgresql":
{
"parameters":{
"synchronous_commit": "on",
"synchronous_standby_names": "*"
}
}
}
JSON
docker exec -it demo-patroni2 patronictl show-config
sleep 30
docker exec -it demo-patroni1 patronictl switchover --candidate patroni2 --force
sleep 30
docker exec -it demo-patroni1 patronictl switchover --candidate patroni1 --force
sleep 30
docker exec -ti demo-patroni1 patronictl list
docker exec -ti demo-patroni1 etcdctl member list
docker exec -it demo-patroni1 psql -c "
create table demo as select now() ts;
"
I have added the creation of a demo
table and my cluster has the primary in Zone 1, a read replica in Zone 2 and the standby in sync in Zone 3:
Run DML on each node
The following connects each second to each node. I'm connecting from the containers, locally, because I will disconnect the network to simulate a zone failure. However, I still want to show what happens with local connections. Usually in the cloud, you have application servers in each zone.
for i in {1..3}
do
(
while sleep 1
do
docker exec -i -e PGCONNECT_TIMEOUT=1 demo-patroni$i psql -Atc "
update demo set ts=now() returning $i node,*
,' === updated from $i
==='
"
docker exec -i -e PGCONNECT_TIMEOUT=1 demo-patroni$i psql -Atc "
select $i node,*
,' === read from $i '
from demo
"
done 2>/dev/null | grep ===
)&
done
In normal conditions, the reads show the same value (no gap when in sync, limited gap when async) and the update is successful on the primary, returning the updated value that will be seen by the next reads. This is what I have here:
Isolate Zone 1
To simulate a network partition for the Zone 1, where the primary database is, I'll disconnect the containers in this zone. I'm using the following function:
isolate(){
docker network disconnect patroni_demo demo-patroni$1 &
docker network disconnect patroni_demo demo-etcd$1 &
wait
}
With this, I isolate the Zone 1:
isolate 1
Immediately the updates cannot succeed because they cannot be sync'd to any standby:
Then the automatic failover is initiated by the Patroni agent, to the standby in Zone 3 which will accept new writes after a while:
The new standby is available for reads and writes, but there are two anomalies here. The last successful update was from Zone 1 at 08:16:34
. While the primary was stall, this is the value seen from the replicas in Zone 1 and Zone 2.
- However, it seems that an update was successfully recorded as committed at
08:16:39
in Zone 1, even if it was never seen from the client and never seen from any of the replicas. - This value that was, that never seen as committed, continues to be the last state of the database when querying from Zone 1, once it has been reinstated as a read replica. And this as long as the Zone is isolated from the others.
In my opinion, this replica in Zone 1, being out of the quorum, should never have been opened a read replica. This is a read-only split brain, with stale and dirty read. I'm not aware of a Patroni setting for this. please comment if I missed something.
Note, however, that the dirty read above is not as bad as it may seem because it doesn't break Atomicity or integrity (the A and C in ACID). Even if the transaction was never considered as successfully committed, the application had the intention to commit it and all integrity checking have been passed, or it would have been rolled back. And remember, this is from a read replica: do not expect ACID there.
Zone 1 is back
Such failures in a cloud provider can happen but are short duration. I reconnect the network:
docker network connect patroni_demo demo-patroni1
docker network connect patroni_demo demo-etcd1
The Zone 1 has to catch-up with the others and then is finally consistent with the others:
Why is it important to understand that?
Even if the failover is automated and successful, you have a lot of manual actions to do after a failover. Be sure that there were no critical reports shared publicly with the wrong result from the stale replica. You should also check your backups if you take them from the replica, because they are corrupted with a value that has never been considered as committed. Even if many things have been automated, thanks to PostgreSQL synchronous replication, and Patroni orchestration, a primary/standby is still a DR (Disaster Recovery) solution and not a full HA (High Availability) one which is transparently resilient to failures. That's one point that is addressed by cloud-native Distributed SQL: you don't have to wake-up anyone if an availability zone is down during the night.