CrowdStrike blames buggy testing software for disastrous update
A bug in the Content Validator – a software element CrowdStrike relies on for testing and validating Rapid Response Content updates for its Falcon Sensors – is (partly) why the faulty update wasn’t caught in time, the company said.
In a period of (approximately) an hour and 20 minutes on Friday, July 19, 2024, the defective update was delivered to around 8.5 million systems, and triggered a massive worldwide outage of Windows-based systems.
CrowdStrike explains what happened
“CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed,” the company explained on Wednesday, in a preliminary post incident review.
Sensor Content is “always part of a sensor release and not dynamically updated from the cloud”.
Rapid Response Content – which “is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine” and comes in a proprietary binary file (and not as code or a kernel driver) – is delivered as content configuration updates to the Falcon sensor and the data is written to the host’s disk.
Sensor Content uses “Template Types” – code that has “pre-defined fields for threat detection engineers to leverage in Rapid Response Content,” CrowdStrike clarified.
“Rapid Response Content is delivered as ‘Template Instances’, which are instantiations of a given Template Type.”
Unfortunately, the problematic content data in the buggy update – i.e., Template Instance – was not detected by the (also buggy) Content Validator, and was deployed in production, causing an out-of-bounds memory read that triggered an exception that “could not be gracefully handled” by the host, and ultimately resulted in a blue-screen-of-death loop.
CrowdStrike wows to improve update testing and rollout processes
Aside from the bug in the Content Validator, CrowdStrike’s excuse for not catching the defective update is that several Template Instances based on the (a relatively new but stress-tested) Template Type that the defective Instance also used were deployed and performed as expected in production.
The company has outlined what it intends to do to prevent incidents like these from happening again, and it includes:
- Implementing a variety of testing types for Rapid Response Content
- Adding additional validation checks to the Content Validator for Rapid Response Content
- Improving how the Content Interpreter handles errors
- Implementing a staggered deployment strategy for Rapid Response Content (which will include a canary deployment) and improving monitoring for glitches when the various rollout phases happen
But, equally importantly, it promises to give customers some control over when the Rapid Response Content updates are deployed and to provide release notes for them.
Some of the improvements – such as the staggered rollout of updates – seem like common sense and it’s surprising that this safeguard hasn’t been implemented already. Let’s hope that other cybersecurity / EDR vendors will learn from this incident and will improve their own update delivery processes and protections.
What should customers do?
But changes are required on the customers’ side, as well.
“I’ve seen a numerous LinkedIn posts from people at cybersecurity vendors (not CrowdStrike) claiming it is the customer’s fault for not having disaster recovery — which is obviously nonsense, and doesn’t work due to how these tools are assembled,” security researcher Kevin Beaumont noted.
“We the customers, should demand more transparency from our cybersecurity vendors selling us endpoint tooling.”
(That’s not to say that customers should not have business continuity plans in place and be ready for all eventualities.)
The negative impact this faulty update has had on a huge number of organizations and their customers has spurred the US House Committee on Homeland Security to call CrowdStrike CEO George Kurtz in for a public testimony.
In the meantime, both CrowdStrike and Microsoft have offered tools and advice to help organizations speed up the restoration of their affected systems, and have warned them about threat actors taking advantage of the confusion.