SRE book notes: Data Integrity, What You Read Is What You Wrote

Hercules Lemke Merscher - Feb 14 '23 - - Dev Community

These are the notes from Chapter 26: Data Integrity: What You Read Is What You Wrote from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:


We might say data integrity is a measure of the accessibility and accuracy of the datastores needed to provide users with an adequate level of service. But this definition is insufficient.

When considering data integrity, what matters is that services in the cloud remain accessible to users. User access to data is especially important.


No one really wants to make backups; what people really want are restores.

backup and restore


From the user’s point of view, data integrity without expected and regular data availability is effectively the same as having no data at all.


In designing a data integrity program, it’s important to recognize that replication and redundancy are not recoverability.

Datastores that automatically sync multiple replicas guarantee that a corrupt database row or errant delete are pushed to all of your copies, likely before you can isolate the problem.


Defense in depth comprises multiple layers, with each successive layer of defense conferring protection from progressively less common data loss scenarios.

The first layer is soft deletion (or "lazy deletion" in the case of developer API offerings), which has proven to be an effective defense against inadvertent data deletion scenarios. The second line of defense is backups and their related recovery methods. The third and final layer is regular data validation.


The most common and largely effective technique used to back up massive amounts of data is to establish "trust points" in your data—portions of your stored data that are verified after being rendered immutable, usually by the passage of time.


“Bad” data doesn’t sit idly by, it propagates. References to missing or corrupt data are copied, links fan out, and with every update the overall quality of your datastore goes down. Subsequent dependent transactions and potential data format changes make restoring from a given backup more difficult as the clock ticks. The sooner you know about a data loss, the easier and more complete your recovery can be.


Shunting some developers to work on a data validation pipeline can slow engineering velocity in the short term. However, devoting engineering resources to data validation endows other developers with the courage to move faster in the long run, because the engineers know that data corruption bugs are less likely to sneak into production unnoticed.


The central infrastructure team maintains the out-of-band data validation framework, while the product engineering teams maintain the custom business logic at the heart of the validator to keep pace with their evolving products.


If you take away just one lesson from this chapter, remember that you only know that you can recover your recent state if you actually do so.

If recovery tests are a manual, staged event, testing becomes an unwelcome bit of drudgery that isn’t performed either deeply or frequently enough to deserve your confidence. Therefore, automate these tests whenever possible and then run them continuously.


Failures are inevitable. If you wait to discover them when you’re under the gun, facing a real data loss, you’re playing with fire. If testing forces the failures to happen before actual catastrophe strikes, you can fix problems before any harm comes to fruition.


General Principles of SRE as Applied to Data Integrity

  • Beginner’s Mind: Never think you understand enough of a complex system to say it won’t fail in a certain way.
  • Trust but Verify: Perfect algorithms may not have perfect implementations.
  • Hope Is Not a Strategy: Prove that data recovery works with regular exercise, or data recovery won’t work.
  • Defense in Depth: The best data integrity strategies are multitiered—multiple strategies that fall back to one another and address a broad swath of scenarios together at reasonable cost.

Data availability must be a foremost concern of any data-centric system.

Recognizing that not just anything can go wrong, but that everything will go wrong is a significant step toward preparation for any real emergency.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Jandira Sonnendeck on Unsplash

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .