When Bad Code Crashes a Billion Windows Computers 🚨

Arjun Vijay Prakash - Jul 21 - - Dev Community

Recently, a significant "bad-code attack" affected many organizations due to a problematic update from CrowdStrike Falcon, a well-regarded cybersecurity software, mainly on Windows PCs.

This incident caused widespread issues, including business interruptions, delayed flights, and disrupted news broadcasts.

This blog aims to examine the technical aspects of what happened, how it impacted systems, and the measures taken to resolve the issue. Let's get into it:


Where and When did it start?

Troy Hunt, the creator of Have I Been Pwned, first brought attention to the issue on Friday, July 19, 2024.

Imagine waking up on deployment day and seeing the screen full of this shade of #0664e4. I don't really care if it has a name.

Initially, there were concerns about a potential cyberattack from a hacker, but it became clear that the problem came from a faulty update issued by CrowdStrike.

This update resulted in numerous computers experiencing the Blue Screen of Death (BSOD). Sarcastically, making the day - International Day of BSOD.


CrowdStrike Falcon: A Technical Overview

Image
Yes, Crowdstrike protects 298 out of the Fortune 500 companies!

Let's take a look at Falcon now:

CrowdStrike Falcon is a sophisticated Endpoint Detection and Response (EDR) solution designed to protect enterprise systems from cybersecurity threats.

Unlike traditional antivirus software that operates primarily at the user level,

Falcon integrates deeply with the operating system, especially that of Windows, leveraging "kernel-mode drivers" to monitor and intercept potential threats at a very-very low level.

Image
Just mentioning it for the sake of the "statistics."


The Role of Kernel-Mode Drivers

Kernel-mode drivers operate at a privileged level within the operating system, providing them with direct access to hardware and system resources.

This allows them to perform critical tasks efficiently. However, any issues with these drivers can lead to severe system instability, as they interact closely with the core components of the operating system.

And this was the sole reason for this world drama.


The Faulty Update

The problematic update from CrowdStrike Falcon included a corrupted driver file filled with zeroes instead of valid executable code.

The problematic update from CrowdStrike Falcon included a corrupted driver file filled with zeroes instead of valid executable code.<br>

https://x.com/hackerfantastic/status/1814315027911843998

When the system attempted to load this driver, it caused immediate system crashes, leading to the BSOD.

This error multiplied across many systems due to the widespread use of CrowdStrike Falcon in business environments.


Widespread Impact

Image

The faulty update had far-reaching consequences:

  • Business Operations: Many businesses experienced interruptions, leading to cancelled meetings and halted workflows.

  • News Broadcasts: News networks faced significant disruptions in their broadcasting capabilities.

  • Flight Operations: Airports encountered delays as critical systems used for managing flights were rendered inoperative.

  • Retail Operations: Stores relying on computer systems for sales and inventory management faced operational challenges.

Globally 5,078 air flights, 4.6% of those scheduled that day, were cancelled.

Everywhere, this blue screen was found.

But somewhere a red screen was also found:
Image

Crowdstrike tanks down by ~20% this month.

Response from Key Figures

Image

George Kurtz, CEO of CrowdStrike, addressed the issue, emphasizing their efforts to rectify the situation. However, from my point of view(of course), his communication lacked a "direct apology", which some interpreted as a lack of acknowledgement of the severity of the problem.

Later, at an interview, he did exactly that:
Image

Image

In contrast, Satya Nadella, CEO of Microsoft, provided a clear and concise statement, reassuring users that Microsoft was working closely with CrowdStrike to resolve this issue.


Technical Resolution Steps

Image

https://en.wikipedia.org/wiki/2024_CrowdStrike_incident#Remedy

Lessons Learned and Future Precautions

This incident underscores the importance of rigorous testing for software updates, particularly those involving kernel-mode drivers.

It also highlights the need for clear and empathetic communication from companies when issues arise

Moving forward, companies can adopt several best practices:

  • Comprehensive Testing: Implement thorough pre-release testing procedures to identify potential issues.

  • Incremental Rollouts: Deploy updates gradually to monitor for issues before widespread distribution.

  • Clear Communication: Provide transparent and empathetic communication to affected users, including detailed steps for resolution.

The third one is the most important. When a significant issue occurs, it's essential for company leaders to publicly apologize.

While an apology alone cannot undo the damage, it shows accountability and a commitment to addressing the problem.

It's obvious they can't undo the damage, but they should work towards resolving the issue by engaging with customers through technical meetings.


Conclusion

In conclusion, this situation shows how important it is for companies to test their updates carefully before releasing them.

Mistakes can have huge impacts, as we’ve seen with the problems caused by the bad update.

Companies should fix problems clearly and honestly, and always be ready to help their customers through the mess.

As earlier mentioned: even though they can’t undo what’s happened, they should work hard to make things better and keep everyone informed.

Comment your thoughts on this billion computers outage drama.

Connect with me @ Linktree. Follow me on @ Twitter.

Happy Coding! Thanks for 26498!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .