Different Roles for Read Replicas in PostgreSQL and YugabyteDB

Franck Pachot - Aug 8 - - Dev Community

Enterprises have not waited for distributed SQL to start scaling out their databases. In cloud environments, running all your application use cases on a single instance is not a solution due to the high cost of large instances and their increased vulnerability to failures in the cloud.
To guard against failures with minimal downtime, databases in the cloud usually run with two standby databases in different availability zones. In case of failure, these standby database instances must be provisioned to be the same size as the primary database to handle the same workload. However, running two large and mostly idle instances is not cost-effective.

As a result, many companies attempt to offload some activity to the standby databases. In traditional databases, the standby cannot do much as it does not share the state (shared buffer cache, lock management, and transaction status) with the primary instance. This is due to the monolithic architecture of those databases, and the standby replica can only receive changes with a delay through WAL streaming.
Synchronous commit is not enough to guarantee ACID properties as it doesn't synchronize the current state of ongoing transactions. Read replicas are always behind the current state.

What can be offloaded is the read-only activity that can accept reading a stale state, such as some reporting. Multi-Version Concurrency Control (MVCC) databases can read a consistent snapshot of the database without interacting with the write activity, so the read replica can provide a read-only view of a consistent state even if it is not the current state. It is known as timeline consistency. However, there is still some dependency on the write activity because the writes, shipped through the WAL, must be applied to the shared buffers accessed concurrently by read-only queries. A delay in applying the WAL on a replica can block operations in the primary.

There are two common configurations: streaming replication, where the WAL (Write Ahead Logging, transaction journal, or redo log) is sent to the read replica to be applied to its shared buffers and checkpointed to its local storage, and the other, as in Amazon Aurora, AlloyDB or Neon, when the WAL is sent to distributed storage servers that are shared between the primary and the replicas. In both cases, the WAL must be sent to the read replica instances to maintain their buffer cache. These Multi-AZ configurations are detailed in the following article:

PostgreSQL

PostgreSQL adds complexity due to its implementation of Multi-Version Concurrency Control. When writing data, PostgreSQL does the minimum, which can leave the database in an inefficient state for queries. Additional garbage collection, called VACUUM, must address this and run frequently. Long queries can block VACUUM and impact the database in several ways, such as creating more bloat on the primary side, increasing replication lag and then the staleness of the read-only transactions, or impacting the availability of the read replica, canceling the queries or even re-starting the read replica.
This is well described by AWS in Manage long-running read queries on Amazon Aurora PostgreSQL-Compatible Edition

YugabyteDB

In YugabyteDB, all nodes are distributed and available for consistent reads and writes. This YugabyteDB architecture has two significant differences from PostgreSQL.

Firstly, replication is not a layer built on top of the Write-Ahead Log (WAL) used for recovery. Instead, it is built into the distributed transactional storage. Table rows, index entries, and transaction intents are distributed to "tablets" with the Raft algorithm, with each tablet having its own Raft leader that coordinates the reads and writes for its small set of rows to ensure consistent reads and writes.
Raft followers are updated with synchronization to the quorum, and while an individual follower may lag and need to reach a read quorum to know the latest state, the staleness of the Raft followers is known and bound as they receive heartbeats from the Raft leader.
Typically, a single follower can provide timeline snapshots with bounded staleness, which cannot be lower than a few seconds to account for heartbeats (occurring every 500 milliseconds), network delay (in 100s of seconds when geo-distributed), and maximum clock skew (500ms without precision time protocol using atomic clocks). As the Multi-Version Concurrency Control (MVCC) retention is higher, a Raft follower can typically provide a guaranteed consistent staleness of 15 seconds, with high performance and availability.

In YugabyteDB, no VACUUM process can be blocked by the write activity, as seen in PostgreSQL. Instead, the MVCC garbage collection occurs at a lower level and is combined with SST File compaction. This process does not disrupt ongoing writes and operates independently on each tablet replica. Reading from a Raft follower with a 15-second lag from the current state will not affect the Raft leader, and the replication process is not delayed by read activity. Additionally, long queries are only canceled if they exceed the configurable MVCC retention period, similar to the undo_retention of Oracle database.

With YugabyteDB, there is no trade-off between read workload availability, impact on the primary workload, and staleness. You can ignore the PostgreSQL parameters like max_standby_streaming_delay, hot_standby_feedback or vacuum_defer_cleanup_age and connect to any node to distribute all workloads.

From an application point of view, you don't need to separate the use cases that can accept stale reads. The application uses YugabyteDB as one global SQL database, which is resilient to failure and can scale with elasticity and high availability without provisioning idle resources on a standby instance. The database distributes reads and writes transparently to the tablet leaders, and adding new nodes will automatically rebalance the load.

Suppose you implemented Command Query Responsibility Segregation (CQRS). In that case, running the Query use cases from a dedicated connection pool that sets the transactions read-only is beneficial, but it is not mandatory. Two features of YugabyteDB that can benefit from it are follower reads and read replicas.

Follower reads

The tablet leader, elected by the Raft algorithm on a per-tablet basis, handles all consistent reads and writes. The leader has a lease duration, typically two seconds, to prevent split-brain situations in the event of a network partition. The lease is renewed when the leader communicates with the quorum. If the majority of replicas are isolated, they can elect a new leader in the remaining quorum, ensuring that transactions can continue.
It might seem that the tablet followers are only there for resilience and are not used beyond failure situations. However, this is not an issue regarding resources because they are balanced across the cluster, and all nodes have some leaders and followers. This setup has the advantage that, in the event of failure, a new leader is immediately available without needing recovery. The follower's SST files are already compacted and start with a warm cache on the system.

Still, the Raft followers can be used for reads. When you set the transaction read-only and allow reading from followers (set yb_read_from_followers to on) with a defined staleness (set yb_follower_read_staleness_ms to 30000), the read operations will not always be sent to the Raft leader but to the closest replica, which can be a follower. The distance is evaluated in the cloud topology hierarchy: node, zone, region, and cloud provider.

The previous paragraphs served as an extensive introduction, and the main topic of read replicas has not been addressed yet. This is because they are unnecessary in a distributed SQL database where all nodes in the cluster are active. Despite this, they still serve the purpose of speeding up reporting in remote regions.

Read replicas

Why would we need another feature to serve our read-only workloads when all nodes in the cluster can handle them from leaders and followers? Stretching a cluster geographically introduces latency to maintain consistency. We want to avoid this drawback when we can accept a limited staleness.
For example, with two close regions like Frankfurt and London, and an additional one in Mumbai, in YugabyteDB, I can define a replication factor of 3 where the Raft leaders are always in Europe, along with one follower. The consistent writes in Europe remain fast because they receive quorum acknowledgement with a 10ms latency. Reporting in Asia is also fast as it can read from the follower. However, a YugabyteDB cluster also aims for resilience to failure. If one of the European regions goes down, the database must continue to accept reads and writes. In this case, the quorum acknowledgment for writes will now involve Asia and a 200ms latency.
The solution is to keep the replication factor at 3 in Europe and add additional replicas in India.

In YugabyteDB, a Read Replica extends the cluster, which is then called a "universe," by adding extra tablet peers for all tablets in each read replica location. These peers do not act as leaders and are not involved in the write quorum, but they receive writes asynchronously. They have an observer role, similar to the followers when receiving raft log entries, but they will never participate in the quorum, have no voting rights, and will never become a leader.

Like followers, read replicas can provide timeline-consistent reads for read-only transactions but are not used for high availability or disaster recovery. By renouncing this responsibility, read replicas do not impact writes and are an ideal solution for reporting in remote regions.

A read-replica region contains a copy of each tablet from the primary cluster's tablet servers. However, they share the universe metadata stored in the YB-Master in the primary regions. This means that some operations, like connection pool initialization or the first parsing of queries, may involve cross-region latency. Once the connections are established and the YB-TServer cache is warmed up, the queries will be local as long as the read replicas are within the accepted staleness.

You can add multiple nodes and set a replication factor to ensure elasticity and resilience in the read-replica region. An odd number is unnecessary because there is no need for a quorum, so a replication factor of 2 is possible. This setup allows the read workload to continue locally even in the event of a failure.

xCluster replication

When discussing solutions to reduce the latency of read-only workloads, it's important to mention another feature of YugabyteDB. In a scenario with read replicas, queries are offloaded, but the primary cluster still needs to be available. Another solution is to separate the read-only workload by creating another Universe in a remote region and synchronizing it asynchronously with cross-cluster replication.


With YugabyteDB, you can handle read-only workloads without demoting to eventual consistency. Cluster replication is synchronous to quorum, ensuring horizontal scalability for all workloads, with high performance and resilience. Additionally, you can add asynchronous replication and run regional read replicas to accommodate latency variations and support multi-region deployments. It's worth noting that read replicas operate on dedicated nodes with their compaction. They can have a higher MVCC retention than the primary server, making them suitable for long-running queries.

Geo-Distributed Database Options | Yugabyte

Geo-distributed databases are crucial to build and deploy global applications that meet customer expectations. Here are 6 deployment options.

favicon yugabyte.com
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .