The two layers, SQL processing and distributed storage, are deployed in one binary: yb-server. The yb-tservers in YugabyteDB play a pivotal role, handling both query processing and data storage in a linearly scalable manner. The shared components, such as cluster metadata and the PostgreSQL catalog, are stored separately in a yb-master component. This yb-master, functioning like a single tablet, is replicated for high availability but does not need to scale. The yb-tservers cache the necessary data to run these statements efficiently.
When deploying on Kubernetes, yb-master and yb-tserver are two StatefulSets:
The yb-server tablets and the yb-master single tablet are replicated to ensure high availability. The fault tolerance of the database, a critical aspect of its design, is determined by the placement information (cloud, region, and zone) and the replication factor. For instance, with a Replication Factor of 3 over three availability zones, the database can withstand the failure of one availability zone. There is one yb-master per zone and multiple yb-tservers. Similarly, with a replication factor of 5 across three regions, the database remains accessible if one entire region becomes inaccessible, and it can withstand an additional failure in the remaining region. This is typical resilience in a cloud environment: transient failures happen frequently, and you may want to be resilient even during a less frequent regional outage.
YugabyteDB employs the Raft algorithm to elect a leader for each tablet, which offers superior performance for SQL applications where read activity is critical. In SQL, even write operations must be read first to ensure ACID properties, so read performance is essential.
Reads and writes are directed to the table leader, which synchronizes replication and eliminates the need to contact additional nodes for consistent reads. Writes are synchronized to Raft followers and are only acknowledged after the majority of them confirm the write operation reached their local storage. This approach delivers strong consistency along with high performance. In the event of leader inaccessibility due to failure, a multi-node quorum is required only to elect a new leader without any data loss, and reads can be served from it.
Each tablet forms an independent Raft group, generating its own Write-Ahead Logging (WAL) – the Raft log distributed to the tablet peers. As the database grows, tablets can split, allowing for linear scalability, unlike traditional databases with a single stream of WAL for the entire database. It's essential to note that this horizontal scalability adheres to the CP aspect of the CAP theorem. In the event of a network partition, only the part with a quorum remains available to ensure consistency, offering high availability but not full availability.
There have been claims that Raft may encounter lost updates, but this is incorrect. Raft is a consensus protocol that ensures all servers agree on a value. If they cannot reach a consensus, the write is not acknowledged. YugabyteDB employs additional techniques on top of Raft to guarantee this. Leaders have a lease to prevent split-brain scenarios—a newly elected leader must wait for a two-second lease before accepting reads and writes.
Additionally, clock synchronization with a Hybrid Logical Clock (combining Lamport clocks for the logical part and maximum clock skew for the physical part) helps YugabyteDB remain linearly scalable while guaranteeing consistency. These techniques ensure consistency at the cost of briefly reducing availability in the event of failure.
The probability of simultaneous failure of all tablet peers is improbable, but you can choose to fsync each WAL write for additional protection with extra latency.
YugabyteDB's resilience relies on the Raft consensus. It can withstand failures as long as the majority of replicas can communicate. No node will serve consistent reads and writes if they are not part of this quorum, and this guarantees no data loss according to the defined fault tolerance.