Crowdstrike causes the largest IT outage in history, massive questions about testing regime

Crowdstrike causes the largest IT outage in history, massive questions about testing regime

Yesterday was one of the craziest days in IT, if you know an IT Administrator, best to give them a hug, they probably need one.

CrowdStrike is an American cybersecurity company that provides endpoint security software, used by more than 20,000 companies and is installed on millions of PCs.

Yesterday the company released an update to their security software which included a bug that caused a virtually instantaneous Blue Screen of Death (BSOD) on Windows PCs. Once machines were in this state, they would reboot and the user would again see a BSOD.

At around 4PM Australian time on Friday 19th July, 2024, the issue began being reported as a Microsoft outage, which then after further investigation was understood to be an issue caused by Crowdstrike.

The world has never seen an impact of this scale before and thankfully it turned out to not be malicious. The outages caused by CrowdStrike included:

Travel chaos: Thousands of flights were cancelled or delayed worldwide, causing significant disruptions for travellers and airlines during the peak summer season.

Business interruptions: Many businesses and government agencies experienced operational disruptions, affecting productivity and service delivery. This included banks, hospitals, emergency services, and media companies.

Economic impact: Some businesses were forced to close, losing revenue, while others lost productivity as employees were unable to work. The outage undoubtedly caused financial losses for affected businesses, particularly in the travel and service sectors. CrowdStrike itself lost Billions of dollars from its Market Cap overnight as a result of the issue.

Public inconvenience: Many individuals were inconvenienced by the disruption of essential services, such as online banking, hospital systems, and emergency communication channels.

I covered the issue on X for more than 5 hrs yesterday and this post from Sydney Airport with the flight information displays all showing BSODs has now accumulated more than 840,000 Views.

So what happened?

Zach Vorhies (@Perpetualmaniac) does a great job of detailing the root cause of the issue. Crowdstrike’s code in the new update attempted to call a invalid memory address – 0x9c which Microsoft in turn, immediately terminated the application and in turn took down the whole Operating System.

It’s important to remember that security software is installed that establishes a trust relationship the vendor. Given the nature of viruses / exploits will attempt to change low-level system files that windows requires to run, we have to trust a security product enough to provide high-level access to system files in order to scan them for malicious activity.

This access is provided in order to protect the machine and the ongoing task of protecting a machine requires updates to respond to new exploits and vulnerabilities found in the wild. This means the update process that caused yesterday’s issue is completely necessary and valid and done by many security products (including Microsoft’s own defender).

What we expect from security vendors that have this high-level access to critical files (and access to memory) is that they thoroughly test their code before it gets anywhere near a customer machine.

When programmers write code to address the memory, they have a responsibility to check for null values and when the code made an invalid attempt to address that region of the computer’s memory, it happened to collide with the space reserved for System Drivers (allowing hardware and software to talk to one another).

Given driver’s have privileged access to the computer, the operating system was forced to crash immediately, causing in the end-user symptom of a BSOD. Generally computers can recover with a simple reboot, but that wasn’t the case here.

Crowdstrike Analysis:

It was a NULL pointer from the memory unsafe C++ language.

Since I am a professional C++ programmer, let me decode this stack trace dump for you. pic.twitter.com/uUkXB2A8rm

— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

Why is this bad?

This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS.

That is what you see here with this stack dump.

— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it’s because it’s a crash in the system drivers.

— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

How is this resolved?

Having broken much of the world, CrowdStrike eventually issued a public statement (Available here), hours after the finger was firmly pointed at them. The George Kurtz, the CEO CrowdStrike is now on an apology tour, which will not do much to mitigate the global outrage at the company.

For those companies that use CrowdStrike and had the update propagate through their devices, they needed to do the following steps to remove the bad update.

The fix seems easy enough with just 4 steps to work around it, but the reality was very different.

Step 1 was to Boot Windows into Safe Mode. At home, you’ll have a decent chance of doing this, but enterprise-deployed devices will not allow regular users to do this.

Most business machines (and hopefully personal machines) will use Bitlocker driver encryption so that if a machine is lost or stolen, the data on the drive can’t be read without a credential or Bitlocker key that was generated at the time the drive was encrypted. When IT Admins deploy Bitlocker, they are responsible for storing this key and while they may have access to it, end-users do not.

The other massive factor here is a logistic one. Jumping into safe mode, almost always means access to the physical machine, so those Admins (or 3rd party IT service providers) need to go to each machine with a different 25-character code to enter safe mode.

Steps 2-4 are simple once you overcome the first step.

The above workaround would get a machine back up and running and as the cause and workaround were identified, system recovery began.

While this hit on Friday afternoon for Australia, there were plenty of machines that were waking up and getting the machine as the time zones rolled around the world.

CrowdStrike identified the issue and pulled the update from deployment to stop the bleeding. Their next task was to deploy a new update to resolve what they were trying to address with the first one. This took place and any new machines getting updates from CrowdStrike would not be impacted.

What an insane, wild ride that was.

What is this prevented in the future?

As I said before, this access is required by security products to keep your machine safe and deployments of security products is common in enterprise to guard against security attacks from motivated actors.

This means simply uninstalling CrowdStrike, or banning them from updates is not the solution.

The fix going forward will certainly be far more rigorous testing regimes at CrowdStrike (and other AV vendors). Not only was the bad code written, but we expect it passed automated testing/validation and that was enough to get out the door.

Software of this importance, with this level of access, really needs to go through phased rollouts, small groups first, then over time with success, and an acceptable amount of issues/feedback (hopefully zero), then it could progress to the next stage of rollout.

As Microsoft does with their Windows Insider Release Rings, this would allow the issue to be found and addressed when it’s only on a small number of machines and we could have avoided anything like what we seen yesterday.

Microsoft should also consider what they allow on Windows too. While backing up the machine constantly for a roll-back point isn’t always practical, a machine should be able to roll back defective drivers with a reboot if they have a bug like this.

Let us know in the comments how you were impacted by #Cloudstrike

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : TechAU – https://techau.com.au/crowdstrike-causes-the-largest-it-outage-in-history-massive-questions-about-testing-regime/

Exit mobile version