Right now, there’s an engineer at CrowdStrike feeling the weight of the world on their shoulders. On July 19, 2024, at 04:09 UTC, CrowdStrike released a configuration update for its Falcon threat detection product to Windows systems, as part of normal operations. This update inadvertently triggered a logic error, causing system crashes and blue screens of death (BSOD) on millions of impacted systems. The resulting computer outages caused chaos for airlines, banks, emergency operations and other systems we rely on.
And it could have been so much worse.
Further research has shown us that the same underlying driver and C++ issues that allowed the bug to take down Windows machines also exists in Linux (and MacOS) servers. In other words, it is only through sheer luck that the update was limited to Windows systems. The damage is estimated in the billions, and would have approached apocalyptic scale had it impacted Linux servers, which account for an exponentially larger share of critical infrastructure.
What happened, exactly?
Based on CrowdStrike’s post incident report, published July 24th, 2024, we now know that this outage was caused by a bug in one of their bespoke test suites, which they refer to as Content Validator. This app is responsible for, unsurprisingly, validating various content updates such as the ones that triggered the outage, before pushing them out for release. The root cause was three-fold:
Bug in Content Validator: A bug in the Content Validator allowed a problematic content update, specifically a template used to define rapid response data for specific, potentially exploitable system behavior, to pass validation checks despite containing content data that led to an out-of-bounds memory read.
Deployment of Problematic Template Instance: The problematic configuration content was deployed on July 19, 2024, as part of a “Rapid Response Content” update. This instance contained content data that, when interpreted by the sensor, triggered an out-of-bounds memory read.
Failure in Error Handling: The unexpected out-of-bounds memory read caused an exception that the CrowdStrike could not handle gracefully, resulting in a Windows system crash (Blue Screen of Death or BSOD).
It’s not the 90’s anymore, how on earth did this happen?
With an outage this widely publicized and impacting so many people, it’s only natural to seek someone to blame, and at a glance, it should be easy enough to do so. Obviously a quality check in Content Validator was missed, a test wasn’t run somewhere. One of those lazy, hapless software engineers forgot to run, or, decided to skip, a test, right?
And who would commit such an egregious dereliction of duty? What engineer in their right mind would dare let something like this slip?
I know first hand that the answer is “almost all of us.”
Because testing still sucks.
A 2023 LambdaTest survey found that 28% of large organizations have test cycles that last longer than an hour. That means for large applications, developers might wait hours or even days to get the feedback they need to do their work, and so they rely on various means of optimization to reduce the number of tests that need to be run. Or they just skip tests outright — especially tests that are known to be non-deterministic, or “flaky.”
Skipping tests has become its own science, complete with its own subgenre of tools. Techniques like Pareto testing, test impact analysis, and predictive test selection have all presented solutions that are truly symptomatic of a deeper problem: that the state of software testing is maddeningly burdensome for engineers.
Large engineering organizations have trouble enforcing quality standards cross-functionally, limiting the usefulness of accepted code coverage solutions like SonarQube and CodeCov, and opening doors for incidents like the CrowdStrike outage. Simply having the scanners and related data is not enough, there must be accountability for setting the right standards and driving adherence to those standards.
Improve your practices to offset increased developer burden
This incident proves that it’s not always ok to skip the tests we think are “safe” to skip, and that we can’t make a priori judgements about how changes will impact systems. The calculated cutting of corners, which we all do to preserve our productivity, will be officially unacceptable going forward. Judging from history, this outage will be used as an example of the need for wider code coverage and a higher priority placed on tests that cover traditionally low-risk changes. All of that sounds great, unless you’re the engineer who’s already dealing with unbearably large test sets.
So if we’re going to ask more of our developers, again, we need to reduce their cognitive load in other ways. We’ll focus on three areas of process improvement which are adjacent to the delivery of software to production: production readiness assessments, service maturity feedback loops, and continuous monitoring of quality metrics.
Fully automated post-CI production readiness assessments
We know from the 2024 State of Production Readiness report that a staggering 98% of organizations have experienced negative consequences as a result of failing to release production readiness standards, which is in essence what happened with CrowdStrike.
Software testing provides some, but not all of the feedback necessary to determine a software’s fitness for production. Content Validator’s code owners and stakeholders would have undergone various readiness assessments each time a new release was ready, including the release that contained the bug which allowed for this outage. Services would be assessed on areas such as test code coverage, number of open critical issues, the state of certain infrastructure tags and so on.
These assessments tend to be lengthy and brittle, taking the form of endless Slack channels or Zoom calls, where each stakeholder will effectively be asked to provide a yes/no response on whether the parts of the release they are responsible for are ready. The “checklist” used for this assessment is often kept in an inefficient system of record, like a wiki or spreadsheet, making it difficult to align on ever-changing standards.
The solution is to continuously monitor the same endpoints that are typically checked manually. This automates the ability to report on the same metrics, providing “at-a-glance” readiness for any stakeholder, and, where possible, providing that status to other systems.
In this example, readiness metrics are collected and visually represented with red/green status, where red indicates metrics that are below operational readiness standards. In this case, any services with metrics in a “red” status are not ready for deployment. When the standards are met, the report will automatically update. This makes it significantly easier to integrate readiness checks with deployment workflows, obviating the need for manual assessments, and freeing engineers to work on higher value tasks.
Collaborative service maturity metric scorecards
Keeping a software service like Content Validator in a state of continuous improvement is harder than it sounds. Not only do developers need to make iterative improvements to the service, but, they must also ensure that existing features stay fresh and functional. Engineers tend to automate much of this through various IDE and CI tools, but keeping track of metrics and data across all those tools introduces significant cognitive load.
An excellent and proven technique for driving all kinds of compliance standards, including maturity standards, across teams is a metric scorecard. Metric scorecards will observe and accumulate data from various parts of the platform, and automatically evaluate a service level based on domain-specific rules.
In the example below, a “Service Maturity” scorecard has been created which will assign Bronze, Silver, and Gold levels to services based on their compliance with various thresholds and metrics. In this case, two rules have been set for a service to achieve “Bronze” status. Services must have at least two service owners associated with them, and the service must have a README file in its repository.
The rules continue upwards through Silver and then Gold status, ultimately requiring metrics like an MTTR of less than an hour, and having no critical vulnerabilities associated with it.
Ideally, service owners will see these scores as part of their daily workflow, giving them a clear path to service improvement, and total clarity on what work needs to be performed to move services into a mature state.
Tools and systems that have scorecarding capabilities, such as internal developer portals (IDPs) make this workflow integration much easier. Ideally, the portal will already be integrated with the parts of the platform, such as incident management applications and quality scanners, so evaluation of scorecard data is efficient and continuous. Further, the developer homepage component of an internal developer portal is a natural place to provide service maturity feedback, obviating manual approval gates and other sources of friction.
If Content Validator’s service maturity standards were continuously monitored, including areas such as test code coverage and validation accuracy, it’s possible that the introduction of the bug could have been detected and flagged before being released and triggering the outage.
Continuous quality monitoring of test sets
We rely on automated testing to validate the quality of the software we create, but what happens when our test frameworks are inaccurate, as was the case with Content Validator? Additional layers of trust must be built into the system, to ensure that the testing itself is efficacious.
In this case, workflows could be built which would allow for the developers of Content Validator to more easily assess the service’s behavior when it is presented with new and incrementally changing fields and data types. Further, these workflows could be executed in multiple environments, such as Windows environments, to trap unexpected behavior and provide feedback to developers.
It’s ok to increase the complexity of the release pipeline on the backend if we make up for it by simplifying interaction on the front-end. So, additional quality tools such as software fuzzers could be introduced, and the data from those systems could be easily evaluated by the portal, since it would already be integrated into the same CI/CD pipelines. That data could be scorecarded in a manner similar to service maturity scorecards above, making it much easier to maintain continuous and sustainable improvements to Content Validator’s accuracy.
Bottom line, lower cognitive load leads to better quality overall
Developers are continuously expected by leadership to balance velocity with quality, with little regard to the opaque or even unknown constraints presented by the developer platform. The industry response to the CrowdStrike outage places software testers directly in the crosshairs, but it’s the state of software testing that should be indicted.
Instead of blaming developers for cutting corners on quality, let’s take a look at the underlying systems that force developers to take shortcuts in the first place. Let’s give them tools like IDPs to make it easier for them to stay compliant.
By implementing better collaborative tools and processes, we can lower the cognitive load necessary for developers to adhere to ever-deeper quality standards, and reduce the odds of another incident like the one caused by Content Validator.