Q&A: There are some important lessons to learn from the CrowdStrike outage

Screens displaying the logo of
Banner: Getty Images

As the tech world recovers from the global CrowdStrike outage, there are some key take aways for all of us

Associate Professor Toby Murray and Dr Suelette Dreyfus, University of Melbourne

Associate Professor Toby MurrayDr Suelette Dreyfus

Published 26 July 2024

Last week, the now infamous CrowdStrike update caused a massive IT outage, something that’s been called “the largest IT outage in history”.

It left people without access to healthcare and banking, cancelled flights, took broadcasters off air, and forced businesses to close.

An office worker points out the Crowdstrike update problem
CrowdStrike deployed a software update for its product Falcon on Microsoft Windows computers. Picture: Getty Images

The company itself, as well as everyone affected, is still dealing with the fallout which saw millions of Windows servers and PCs across the globe end up in an endless reboot cycle.

So, what lessons have we learned? We asked Associate Professor Toby Murray and Dr Suelette Dreyfus from the School of Computing and Information Systems for their take.

Q. As more information comes out about the CrowdStrike outage, what do we now know? 

The outage occurred because on 19 July at 4:09am GMT (2:09pm AEST) the company CrowdStrike deployed a software update for its product Falcon on Microsoft Windows computers.  

Falcon is software that runs on computers and monitors for signs of cyber attack. CrowdStrike have said that the update was designed to help Falcon better detect some new threats.  

However, the update triggered a bug in Falcon instead. 

For Falcon to closely monitor computers for attacks, it is tightly integrated into the core of Microsoft Windows. Windows is like the brain of the computers on which it runs. If something bad happens to the core of Windows, the whole computer crashes.  

This is exactly what happened here: the bug in Falcon crashed all of Windows because the two are so tightly integrated.

Worse, when computers tried to restart, the bug was re-triggered causing them to crash again, which meant that these computers couldn’t be rebooted.  

CrowdStrike deployed a fix about an hour after the faulty update. But because every computer running Falcon on Windows had already crashed, they were unable to download the fix.   

This has meant that IT teams had to manually fix affected computers, which is why the recovery process has taken some time. 

Computers displaying the blue recovery screen
Windows servers and PCs across the globe ended up in an endless reboot cycle. Picture: Getty Images

Q. Are there some key lessons here – both for the tech industry and the general public alike?  

Absolutely. This outage was severe because CrowdStrike’s software tends to be used on critical systems: those that can’t really afford to be victims of cyberattacks. This incident highlights how when software that we all rely on is faulty, it can cause major problems for our critical computer systems.  

Everyone makes mistakes.  

Software vendors like CrowdStrike need to have processes in place to make sure that when mistakes happen, they cannot cause this kind of global disruption.  

That means carefully testing updates before they are deployed, and then only deploying them to a small fraction of computers to make sure there are no problems before they are rolled out more broadly.   

This testing also needs to happen on computers configured in many different ways, because in the real world the way that computer systems are set up can vary.   

These are basic ideas that for reasons unknown were not followed in this instance.  

From a technical standpoint, cyber-security tools like Falcon should not be so tightly integrated into the core of critical systems like Windows.  

Keeping them slightly separate would make them far more reliable and reduce the chances that faulty updates could cause this kind of damage, even if they are widely deployed. 

This incident has also highlighted the other downsides of technology like CrowdStrike’s Falcon, including potential privacy risks – these systems monitor everything that is happening on the computers they protect.  

Cybersecurity systems should be engineered to protect user privacy in the first place. 

 A general view of the Jetstar check in terminal at Melbourne Airport
The outage cancelled flights, took broadcasters off air, and forced businesses to close. Picture: Getty Images

Q. What are the next steps for the various players? 

Microsoft has been working with CrowdStrike to help get systems back up and running. For their part, CrowdStrike have issued some apologies alongside the initial fix for the problem.  

The company has released a preliminary report into the causes of the outage and what they will do to make sure it won’t happen again. This makes it clear that CrowdStrike wrongly assumed that their update was harmless.

Unfortunately, while their plan going forward looks sensible and is the sort of thing they should have been doing already, what they are proposing looks to be insufficient.

It doesn’t include decoupling their software from the core of Windows and won’t guarantee that this sort of incident cannot happen again in future.

CrowdStrike is a market leader and we hope that in time they will go further to improve the reliability of its software, including following the sort of measures we mentioned earlier.

Q. Was this a cybersecurity incident given it wasn’t a malicious attack?  

Even though what happened was not the result of a deliberate attack but rather an error, it still has cybersecurity implications.

Attackers study this sort of event to understand how to make possible future attacks more effective.   

In some cases, where an attacker may already be lurking in a company’s systems waiting for an opportunity to access sensitive data, they would be able to jump at the chance while the security software was offline.  

An event like this can provide a serendipitous moment to steal valuable intellectual property or new product strategy plans that would otherwise be better protected.  

A mobile phone showing the CrowdStrike logo
CrowdStrike’s software tends to be used on critical systems. Picture: Getty Images

More sophisticated attackers who are in it for the long game now know who runs CrowdStrike to defend their systems. Security software like CrowdStrike tends to run in the background. That is why the public had mostly never heard of the company before this event. 

For example, potential attackers could see which airlines were hit and what divisions suffered outages.   

Attackers want to know what software a target organisation uses. This specific knowledge allows cyber criminals to use and develop customised tools.  

As a result of the many visible service outages, attackers have just learned the identities of a significant number of CrowdStrike’s clients. If an attacker doesn’t have to spend time on reconnaissance to find what specific software their targets use, they can put more resources into the actual attack. 

There is also the risk that frustrated customers may now remove CrowdStrike from their critical systems. This could leave organisations more vulnerable.  

While there was no security breach in the CrowdStrike software itself, the fault in their software created risk to clients’ systems and service outages.  

Making sure that IT services are always available and functioning well is a big part of cybersecurity’s overall goals.

It’s worth noting that it would have been the IT security teams in most organisations who would have been responsible for solving the problems created by the outage. 

Finally, we should expect sophisticated adversaries (like hostile nation states) to have deduced from this incident that if they want to compromise 8.5 million critical Windows computers in (largely) Western organisations, all they need to do is to hack CrowdStrike and push out a malicious update to Falcon. 

People walk past the Microsoft store in New York
Microsoft has been working with CrowdStrike to help get systems back up. Picture: Getty Images

Q. Could this happen again? 

The unfortunate reality is that it is certainly possible that this kind of thing could happen again.  

There is plenty of other software that is just as tightly integrated into Microsoft Windows as CrowdStrike’s Falcon. It is conceivable that this kind of issue could re-occur, unless software vendors change their practices. 

This is why it’s important for vendors who sell critical software that has the potential to cause this kind of havoc to implement processes to make sure their updates are safe.  

Ideally these vendors work with Microsoft to re-architect their software so that it operates at arm’s length from Windows, to prevent bugs in their software from crashing Windows in the first place.  

 

Q. If prevention is better than cure, what should other consumer companies be taking away from this?   

It is difficult for companies who are consumers of technology like CrowdStrike’s to do much to prevent this kind of failure from occurring.  

These kinds of software updates are automatically deployed by software vendors, and it is hard for organisations to turn off these automated updates. Normally that is a good thing: these updates ensure that systems are protected against the latest security threats.  

But it means organisations are reliant on companies like CrowdStrike to make sure their software is reliable, and their updates do not cause problems.  

At best, companies can implement disaster recovery plans to allow them to continue to operate in the face of widespread IT outages like this one.  

It’s highlighted the need for these kinds of business continuity plans to be in place. 

 

Find out more about research in this faculty

Engineering & Technology

Content Card Slider


Content Card Slider


Subscribe for your weekly email digest

By subscribing, you agree to our

Acknowledgement of country

We acknowledge Aboriginal and Torres Strait Islander people as the Traditional Owners of the unceded lands on which we work, learn and live. We pay respect to Elders past, present and future, and acknowledge the importance of Indigenous knowledge in the Academy.

Read about our Indigenous priorities
Phone: 13 MELB (13 6352) | International: +61 3 9035 5511The University of Melbourne ABN: 84 002 705 224CRICOS Provider Code: 00116K (visa information)