Citus is not ACID but Eventually Consistent

Franck Pachot - Aug 15 '23 - - Dev Community

In the previous article, I explained the distinction between database sharding (as seen in Citus) and Distributed SQL (such as YugabyteDB) in terms of architectural nuances: sharding above the SQL layer leads to a compromise on SQL features.

In this current post, I will delve into the same subject matter, but with a practical approach: I will conduct a straightforward pgbench test on Citus, to demonstrate its eventual consistency rather than strict ACID compliance. Throughout this demonstration, I will interject comments in italic to facilitate a comparison with YugabyteDB, thereby highlighting the disparity between sharded databases and Distributed SQL. A next post will show the same demo but on a YugabyteDB cluster.

Startup a Citus cluster with Patroni

Citus lacks inherent fault tolerance as each of its shards operates as an independent PostgreSQL instance. In order to mitigate potential downtimes resulting from failures, a viable approach involves safeguarding each shard with a corresponding standby database. Given the intricacy of this setup, Patroni can prove to be instrumental in achieving this objective. To facilitate this demonstration, I will employ the Citus+Patroni docker-compose-citus.yml configuration available at https://github.com/zalando/patroni.git:



git clone https://github.com/zalando/patroni.git
cd patroni
docker build -t patroni-citus -f Dockerfile.citus .
docker compose -f docker-compose-citus.yml up -d
docker exec -it demo-haproxy patronictl list



Enter fullscreen mode Exit fullscreen mode

Image description

This setup employs an HA Proxy to evenly distribute connections among the primary shards via a coordinator: demo-haproxy on port 5000.

YugabyteDB offers an inherent feature of automatic distribution and replication, accompanied by fault tolerance as soon as a cluster of 3 nodes is initiated. The capability to seamlessly integrate additional nodes further ensures automatic load rebalancing. In the event of a node failure, transactions proceed without disruption. Notably, the absence of a dedicated coordinator enables direct connection to any available node.

Create pgbench tables

I create the pgbench tables:



docker exec -it \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy pgbench -iIdtGvp citus



Enter fullscreen mode Exit fullscreen mode

Image description

The standard CREATE TABLE run by pgbench -iI results in the tables being generated exclusively on the coordinator. In order to achieve scalability, it becomes imperative to distribute these tables:



docker exec -i \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy psql citus <<'SQL'
select create_distributed_table('pgbench_accounts','aid');
select create_distributed_table('pgbench_branches','bid');
select create_distributed_table('pgbench_history','aid');
select create_distributed_table('pgbench_tellers','tid');
SQL



Enter fullscreen mode Exit fullscreen mode

Image description

My options for selecting a distribution column were limited due to the constraint that it must be included in the primary key. Otherwise, an error would occur with the message:
ERROR: Distributed relations cannot have UNIQUE, EXCLUDE, or PRIMARY KEY constraints that do not include the partition column (with an equality operator if EXCLUDE).

Note: Unlike the mentioned process for Citus, YugabyteDB, being PostgreSQL compatible, eliminates the necessity for this extra step. A simple CREATE TABLE command is adequate for table creation. Tables are automatically distributed based on the primary key, and the flexibility to incorporate additional UNIQUE indexes do not depend on the sharding key. Distributed SQL, distinct from conventional database sharding, provides all SQL features for distributed tables.

This aspect is intriguing, as there exists a prevalent misconception that a PostgreSQL extension, in contrast to a fork, retains the entirety of PostgreSQL's features. However, the reality is quite the opposite: forks possess the capacity to incorporate their distinct code at levels that extensions are unable to access. Indeed, extensions might necessitate the introduction of supplementary functions that can potentially modify the way users interact with the SQL database.

Run pgbench transactions

To evaluate Citus's PostgreSQL compatibility, I will execute the pgbench test for a duration of 10 seconds:



docker exec -it \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy pgbench -T 10 citus


Enter fullscreen mode Exit fullscreen mode

Image description
I'm running this on a lab. The timing is not important. What matters is that there's no error in the execution and that the result is correct, in a consistent state.

Check table consistency

Similar to numerous OLTP applications, the pgbench application is updating multiple tables. This is done atomically in a transaction, leveraging the inherent ACID properties of the database to prevent anomalies stemming from race conditions. I'm employing the default pgbench workload: TPC-B. To further examine the transaction statements, you can refer to the following command pgbench --show-script=tpcb:

Image description

To verify consistency, I initiate a straightforward query that performs a balance comparison across the Accounts, Branches, and Tellers. Given that these entities are concurrently updated within an atomic transaction, their balances invariably remain equal:



docker exec -i \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy psql citus <<'SQL'
select a.sum, b.sum, t.sum , a.sum = b.sum and b.sum = t.sum
from
  ( select sum(abalance) from pgbench_accounts ) as a
 ,( select sum(bbalance) from pgbench_branches ) as b
 ,( select sum(tbalance) from pgbench_tellers  ) as t
;
SQL


Enter fullscreen mode Exit fullscreen mode

Image description

However, it's crucial to highlight a pivotal aspect. Citus operates on an eventual consistency model. Once all transactions conclude, the database attains a consistent state, as demonstrated earlier. While this characteristic is advantageous for data warehouses, it's essential to note that for OLTP scenarios, strong consistency is imperative. The database must maintain consistency even while ongoing transactions actively modify the database.

Run OLTP workload

Simulating OLTP is easy by perpetually executing transactions. I run the same as above, but in the background (&) and for 60 seconds -T 60.




docker exec -it \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy pgbench -T 60 citus &

docker exec -i \
 -e PGPASSWORD=postgres -e PGPORT=5000 -e PGHOST=demo-haproxy \
 demo-haproxy psql citus <<'SQL'
select a.sum, b.sum, t.sum , a.sum = b.sum and b.sum = t.sum
from
  ( select sum(abalance) from pgbench_accounts ) as a
 ,( select sum(bbalance) from pgbench_branches ) as b
 ,( select sum(tbalance) from pgbench_tellers  ) as t
;
\watch 1
SQL


Enter fullscreen mode Exit fullscreen mode

I check the consistency every second and the result on the Citus distributed tables is always inconsistent:

Image description

This behavior stems from the Citus Data distribution approach. It's important to note that distributed transactions within Citus do not adhere to the ACID properties. While writes maintain consistency through Two-Phase Commit (2PC), this introduces latency due to synchronization across all nodes, specifically while updating the local transaction table in each shard. That would not be scalable for writes and the architecture doesn't opt for Two-Phase Locking (2PL) to verify transaction status across all nodes, which would incur substantial latency. Consequently, Citus operates differently, optimizing reads for speed while potentially sacrificing consistency.

Eventual Consistency

Throughout the duration of the pgbench activity, all queries that were executed exhibited a lack of consistency. Contrary to claims made by certain NoSQL databases, the notion of Eventual Consistency does not entail reaching a fully consistent state at any point in time. The reality remains that, as long as your application is actively operational, the data read from the database remains in an inconsistent state. True consistency is only attained when all concurrent write operations are stopped (and with no network failure), like when pgbench is completed:

Image description

Indeed, within the context of Citus Data, read consistency is exclusively achieved when concurrent activities come to an end. Citus proves advantageous for data warehousing scenarios, such as loading data overnight and querying during the day. However, for OLTP applications, it's important to recognize that fundamental SQL features, including ACID properties, consistent reads, referential integrity, and unique constraints, are only applicable within individual shards. The global cluster cannot be seen as one SQL database.

Here is the quote from the SIGMOD ’21 paper by Microsoft explaining this in 3.7.4 Multi-node transaction trade-offs:

A concurrent multi-node query could obtain a local MVCC snapshot
before commit on one node, and after commit on another. Addressing this would require changes to PostgreSQL to make the snapshot
manager extensible.

Contrary to popular belief, this also underscores that being an extension does not inherently ensure enhanced PostgreSQL compatibility. In the case of YugabyteDB, it operates as a fork, making the necessary changes in the PostgreSQL codebase in order to establish equivalent transaction semantics on distributed tables.

It is not a surprise that the Microsoft managed service using Citus, also known as "Azure Database for PostgreSQL - Hyperscale (Citus)", is part of the "Azure Cosmos DB" familiy where read consistency applies to single shard only. While sharding databases do enhance operational efficiency across multiple monolithic databases, it's crucial to acknowledge that they do not fully align with the characteristics of a genuine (Distributed) SQL database.

The good news is that you can run a managed YugabyteDB on Azure: https://www.yugabyte.com/cloud/microsoft-azure/

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .