CrowdStrike engages external experts, details causes of massive outage

CrowdStrike has published a technical root cause analysis of what went wrong when a content update pushed to its Falcon sensors borked over 8.5 million Windows machines around the world on July 19, and has confirmed that it has hired two unnamed third-party software security vendors to review the security and quality assurance of the Falcon sensor code.

CrowdStrike outage causes

CrowdStrike goes into detail

Expanding on its preliminary post-incident review, the company went into more detail about how the faulty Rapid Response Content – delivered as content configuration updates – failed to be spotted before doing damage.

“Rapid Response Content is used to gather telemetry, identify indicators of adversary behavior, and augment novel detections and preventions on the sensor without requiring sensor code changes,” the company explained.

“Rapid Response Content is delivered through Channel Files and interpreted by the sensor’s Content Interpreter, using a regular-expression based engine. Each Rapid Response Content channel file is associated with a specific Template Type built into a sensor release. The Template Type provides the Content Interpreter with activity data and graph context to be matched against the Rapid Response Content.”

The disastrous update was a Template Instance based on a relatively new Template Type, and was delivered via Channel File 291.

But while the Template Type defined 21 input parameter fields, “the integration code that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against.”

On July 19, a new version of Channel File 291 was pushed to Falcon sensors, specifying a comparison against the 21st input value. “The Content Interpreter expected only 20 values. Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash,” the company says.

The mismatch between the inputs was just one of the things that ultimately led to the massive outage. The others were: the fact that CrowdStrike did not have specific testing that would catch the mismatch, an out-of-bounds read issue in the Content Interpreter, and the fact that the company pushed the updates to every sensor out there.

Also, as security researcher Kevin Beaumont pointed out, “channel updates weren’t tested on a real Windows PC prior to deployment, they relied on automated bespoke code testing.”

The company has outlined the steps already taken (e.g., the ability for customers to choose where and when Rapid Response Content updates are deployed) and those it plans to implement (e.g., the deployment of content updates in several stages) to prevent such an incident from happening again.

On the topic of security sensors needing to leverage kernel drivers, CrowdStrike says that as new versions of Windows add support for performing more security functions in user space, CrowdStrike updates its agent to use it and will continue to do so.

(Other endpoint security companies have laid out their software/update release processes and quality assurance practices since the outage, as well how they use kernel drivers.)

The effects of the outage

The effects of the outage have been felt by CrowdStrike, its customers and, consequently, those organizations’ customers/users.

The price of CrowdStrike shares has fallen considerably since July 19, and the company is getting sued by its shareholders.

Delta Air Lines is looking into suing both CrowdStrike and Microsoft, in hopes of recouping some of the massive losses the experienced because of the outage and (potentially) getting regulators and the US Department of Transportation off its back.

In the wake of the outage, the Electronic Frontier Foundation has called for tougher antitrust enforcement.

“Today’s empires of industry exert more and more influence on our day to day life, building a greater lock-in to their monoculture. When they fail, the scale and impact rival those of a government shutdown,” the EFF says.

“We deserve a more stable and secure digital future, where an error code puts lives at risk. Vital infrastructure cannot be built on a digital monoculture. To do this, antitrust enforcers, including the FTC, the Department of Justice (DOJ), and state attorneys general must increase scrutiny in every corner of the tech industry to prevent dangerous levels of centralization.”

OPIS OPIS

OPIS

Don't miss