SRE book notes: Managing Critical State, Distributed Consensus for Reliability

Hercules Lemke Merscher - Feb 9 '23 - - Dev Community

These are the notes from Chapter 23: Managing Critical State: Distributed Consensus for Reliability from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:

This post is rather short, as the chapter dives into many complex topics such as the CAP theorem, and consensus algorithms such as Raft and Paxos, thus any attempt of summarizing them in a few sentences can be damned to failure. The chapter and the references are a must-read if you intend to manage state in a distributed manner.


In fact, many distributed systems problems turn out to be different versions of distributed consensus, including master election, group membership, all kinds of distributed locking and leasing, reliable distributed queuing and messaging, and maintenance of any kind of critical shared state that must be viewed consistently across a group of processes. All of these problems should be solved only using distributed consensus algorithms that have been proven formally correct, and whose implementations have been tested extensively. Ad hoc means of solving these sorts of problems (such as heartbeats and gossip protocols) will always have reliability problems in practice.


When making decisions about location of replicas, remember that the most important measure of performance is client perception


Whenever you see leader election, critical shared state, or distributed locking, think about distributed consensus: any lesser approach is a ticking bomb waiting to explode in your systems.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Markus Spiske on Unsplash

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .