CrowdStrike: What the 2024 outage reveals about security
The CrowdStrike incident earlier this year had major implications for governments and businesses across the world. Here, we look at what it tells us about the security and robustness of the modern internet.
- Updates, including auto-updates, are a good thing. Don't turn them off.
- More transparency and openness about security vulnerabilities and patches can increase people's trust and the robustness of system architecture.
- Consolidation of power in Big Tech companies creates over-reliance on single points of failure. Diversity creates strength: the internet should not be confined to narrow pathways and walled gardens.
What happened?
On 19 July 2024, American cybersecurity company CrowdStrike released an update to its CrowdStrike Falcon software that ultimately caused 8.5 million computers running Microsoft Windows to crash. The damage done was both deep and wide: deep because the computers affected were unable to recover without direct user intervention. Wide because a whole range of companies - from airlines to healthcare to media - across a whole range of countries - from Sweden to India to New Zealand - were unable to operate properly.
There's no doubt that the cause of the problem was a malformed update for CrowdStrike Falcon rather than hackers, cyberattack or other security breach. However, the event is instructive for how we understand the privacy and security of the devices that we depend on every day, and the effect of a large scale outage has on our modern life.
Damage to the kernel
The CrowdStrike software that caused the problem, the Falcon sensor, identifies and blocks hacking attempts. The sensor requires privileged access to the Windows operating system because it functions within the operating system's kernel - a layer of progamming that sits between the computer's hardware and userspace, where applications run. That means that Falcon sensor has access to the very core of computers running the software: they can interact in a way that bypasses many of Windows' built in security protections.
What is 'the kernel'
The kernel is a special part of the core of the Operating System that manages communication between software and hardware. It has access to key physicals components of a device such as its processors and memory
This direct access, right to the core of the operating system, is often required for cybersecurity software to function. But the events of 19 July 2024 clearly demonstrate the risks of anti-virus and cybersecurity tools having nearly unfettered access to these most sensitive areas of a device's software stack. With an apparently routine update, CrowdStrike took down government and business activities across the world. This time it was an error, but what if a malicious actor had got access to Crowdstrike's update services?
It seems unlikely that many owners of the personal devices affected by the outage would have known that a company they probably hadn't heard of could make changes to the very heart of their devices without notice or warning. How would people know - or be expected to know - if a sophisticated attacker were able to make changes to the kernel that took down the whole device, or perhaps worse, that gave them access to someone's files, communications, and camera.
Auto-updates
Predictably, and sadly, some people have responded to this by clamouring against auto-updates. But this is the wrong conclusion to draw. Security updates - including auto-updates - are incredibly important to keep our devices running properly and safely. What is needed is for auto-updates to be properly tested before being implemented. This is especially true for companies that have access to the kernel (or other sensitive parts of a device).
"Out-of-date devices can become privacy and security liabilities, as well as tools of exclusion. Unpatched software leaves people vulnerable to hackers and cyber-attacks, often depriving them of critical services and resulting in significant financial losses and emotional distress. Consumers’ digital data is also at risk."
For most regular users, auto-updates for security patches are best practice. There should be no barriers to timely fixes in security. That's because people have put their trust in companies to provide safe, secure and ongoing services. It should be as easy as possible for everyday users to be as safe as possible.
For large companies and service providers with thousands or millions of users, the situation may be more complex. A more bespoke process of review and oversight may be required. But in any case, we believe that both CrowdStrike and Microsoft could have had a better system in place to allow for the Falcon sensor auto-update to have been rolled out without causing such enormous disruption.
Automatically applying security updates is another reason for making a clear distinction between security updates and feature updates. The latter improve (or at least change) software functionality rather than fix dangerous problems and should not normally be applied automatically. Too often we see companies bundling together security and feature updates, meaning that users cannot install one without the other. That's a problem, especially if a weaker system for testing feature updates pollutes the process for security updates, or if users are prevented from having the latest security updates installed because they don't want or their device does not support the feature updates.
Notification
Earlier this year, the EU passed the Cyber Resilience Act. This law introduces new requirements for manufactures of digital products to notify ENISA (the EU Agency for Cybersecurity) about “any actively exploited vulnerability” or “any severe incident having an impact on the security of the product", as well as information about any security updates that can protect against the vulnerability. Security updates must be disseminated without delay or charge. The Act also establishes a single reporting platform in order to make reporting as easy as possible for manufacturers.
These changes are positive - reporting vulnerabilities should be simple, straightforward and cost-free for manufacturers. Easy to identify and use portals, like the above button on the National Cyber Security Centre for Ireland's website can help with this.
At PI, we are interested in learning more about how the CrowdStrike incident was reported differently in different jurisdictions around the world (and in fact if it was reported at all). We are also interested to see what difference the Cyber Resilience Act makes in the future.
Corporate power
The range of companies and government bodies affected by the outage are quite staggering. Just some examples include:
Transport
- Australian airline Quantas
- Delta Airlines
- Public transport tickets in Sweden
- Taiwan's Taoyuan International Airport
Infrastructure and government
- Disruptions to 911 emergency lines in at least 12 US states
- Hospital admissions in Belgium
- Singapore postal system
- The Philippines House of Representatives
Finance
- Paraguayan banks Ueno and Continental
- Most UK banks experienced some disruption
Retail
- Some Amazon services, including internal email, warehouses and AWS health dashboard
- UK supermarket Waitrose
Digital and news
- Video screens in New York's Times Square
- Sky News UK
Additional examples can be found on a Guardian liveblog
This goes to show just how dependent our global society and economy is on the proper functioning of large tech companies like Microsoft. And while Microsoft may try to argue that the answer is in restricting other companies from being able to access the Windows kernel, we don't believe that further consolidation of power in Big Tech companies is the way to go.
Excessive concentration of power in a small number of big companies creates gatekeepers for accessing online and digital services. It gives too much control over our lives and our communications to companies like Google and Microsoft. Having so much public and quasi-public infrastructure run on the servers and code of large corporate entities means that we are dependent on them to get around, to buy food, and to run our businesses. The small number of players makes us vulnerable to single points of failure and having those services withdrawn, downgraded, disrupted or even monopolistic pricing practices. When Amazon Web Services goes down, it takes a big chunk of the internet with it. How would we cope if CrowdStrike went on strike?
This power imbalance has also led to highly exploitative practices of data collection and centralisation. A vicious cycle is at play: because of their dominance, these companies collect and analyse vast amounts of data. The more data they collect, the better they become at profiling individuals and offering these profiles to advertisers, political parties, and others, as well as using those profiles to improve the attractiveness of their own services. And the more people are drawn into these services, the less any individual user has the power to opt out of the corporate data exploitation model because no equivalent service exists.
Privacy International has produced a number of guides that help guard against this data manipulation by protecting you from online tracking.
What if something like this was done intentionally?
This incident was not a cyberattack. But it does demonstrate how bad actors could seek to manipulate companies and systems for their own ends.
There will always be bad actors looking to exploit opportunities to get hold of data illegally or through force. On the same day that the faulty update went out, CrowdStrike had already identified nearly 30 domains with names similar to theirs that were or could be being used for malicious purposes. Scammers will look for opportunities to get into people's devices and wallets off the back of failures like these (e.g. via phishing emails or fake offers to fix the problem).
Cybersecurity is a common good. We'd like to see security researchers being able to test products and services: open and transparent security research identifies what defences are needed and challenges information asymmetry. And where possible, open-source practices should be encouraged to allow consumers to maintain devices themselves (though not at the expense of commercial support).
Who suffers
Finally though, let's not forget who the actual victims are when critical IT systems fail. It's the people who couldn't travel to see loved ones, the sick people who couldn't get into hospital and the low-paid zero hours contract workers who were sent home without pay. It's also a bad time for those IT staff tasked with travelling around visiting each affected computer to manually reboot: a monotonous but delicate task that might seem them away from friends and family for days.
Digitisation of services like ID systems, public services and payment infrastructure is all well and good while it works. However, technological architecture is vulnerable to large systemic failures that can bring things down in an instant. Over-reliance on them, over-looking more robust systems that can work locally, manually, robustly and flexibly, is risky.