Learning from CrowdStrike’s quality assurance failures
CrowdStrike has released a preliminary Post Incident Review (PIR) of how the flawed Falcon Sensor update made its way to millions of Windows systems and pushed them into a “Blue Screen of Death” loop.
The PIR is a bit confusing to read and parse, because it attempts to assure readers that the company carefully and comprehensively tests their products – even though the company’s failures on that front are obvious.
Here is the heart of the issue: CrowdStrike claims to test Sensor Content, which ships with its main software product as it goes through revisions. But they also send updates – Rapid Response Content – on a regular basis.
Template Types – instructions on how to gather data and process it – are included in the Sensor Content, and Template Instances – data instructions processed by the Template Types – come as part of the Rapid Response Content.
If this is confusing, you may think of it this way: You buy Microsoft Office, a software product akin to CrowdStrike Falcon. Within Office, you have individual templates that can process various data structures like Word, Excel, etc. Those are equivalent to the CrowdStrike Template Types. Lastly, you have data files, like a .doc file that is opened by Word, which represents a Template Instance that is interpreted by the Template Types. Template Instances are instructions that conform to Template Types, which are specific operating orders.
CrowdStrike states they rigorously test all Sensor Content, including Template Types, before it is deployed to customers. But Sensor Content is not part of what is dynamically updated.
The failure becomes evident
Here is where we must read through the lines, as CrowdStrike does not want to make their failure clear.
What they don’t test rigorously before sending out are the Template Instances, which represent the low-level data that instructs what the sensor should be doing. Interestingly, they do say they are “validated”, but we already know that does not actually include thorough testing, as is evident from the massive outage.
So, in this game of CrowdStrike Clue, it was the Content Validator, as part of the Content Configuration System, which failed to adequately test the flawed Template Instance that killed the Windows clients on the internet.
It appears that CrowdStrike was okay with the lax testing because they believed their Content Interpreter, residing on the client system as part of their software, would “gracefully handle exceptions from potentially problematic content”.
They were wrong.
So, CrowdStrike has implemented an update architecture that only rigorously tests some of the updates sent to clients. The Template Instances are not thoroughly tested before landing on systems and they instead rely on endpoint functions to handle any residual problems.
This is a serious process design failure for their product Quality Assurance.
CrowdStrike should be properly testing every piece of code that is sent to client machines.
Just to show how egregious this is, CrowdStrike allows customers the option of selecting which versions of the Sensor engine will be updated in their systems. They can keep current with the latest release (N) or can delay and keep older N-1 or N-2 versions in place. Many customers want time to conduct internal testing and validation on their platforms before committing to move to a new update.
But that option does not apply to the poorly tested Template Instances which will go out to all clients simultaneously, regardless of whether the customer has indicated they choose to remain on older versions of the Sensor engine.
So, customers who chose to remain on N-2 were still affected.
Promises
CrowdStrike has committed to more types of validation testing, with an emphasis on the Content Validator and Content Interpreter, which reside on customer’s systems. They also want to detect failures on customer systems faster so they can respond in a timelier manner.
That is the absolute wrong strategy – bad code should never find its way to the clients in the first place!
CrowdStrike should commit to conducting more rigorous testing in their in-house Quality Assurance pre-production environment or the Content Configuration System, which resides in the cloud, before it reaches any client systems.
Never send potentially bad code to clients.
Scrutiny before the House
CrowdStrike’s CEO George Kurtz has been summoned to Congress to appear before the US House Committee on Homeland Security. I hope by the time he arrives to answer for the outage, he will have improved the plan moving forward and will clearly commit that all updates, including Template Instances, will be thoroughly tested before ever being distributed to the customer’s endpoint devices.
Quality Assurance is about keeping bad things away from production environments.
Given the deep access that cybersecurity tools possess and the widespread need to have security in our critical digital systems, we must set a precedent in how updates must be properly prepared, tested, and deployed.
The CrowdStrike incident of 2024 will be referenced in the future as a major failure, but we can use it as a catalyst for learning and adapting, to make our digital world more secure, private, safe, and reliable.