Two strategies to protect your business from the next large-scale tech failure

The CrowdStrike event in July clearly demonstrated the risks of allowing a software vendor deep access to network infrastructure. It also raised concerns about the concentration of digital services in the hands of a few companies. A prescient Reddit post noted CrowdStrike is a threat vector for many of the world’s largest corporations, as well as a gold mine of data.

large-scale software failures

Given the worldwide computer shutdowns following CrowdStrike’s failed update on July 19, prudent executives are asking, “How can I prevent something similar from happening again?”

With the market concentration in big tech, it is entirely possible such a widespread outage could happen again. According to Synergy Research Group, the three leading cloud providers – Amazon, Microsoft and Google – account for 67% of the worldwide market. Amazon alone commanded 31% of the market at the end of 2023.

Two strategies could mitigate the effect of similar software failures: diversifying your network infrastructure and practicing for failure. Before we discuss defensive actions, let’s discuss the risks of inviting CrowdStrike or other third-party software suppliers into your business.

CrowdStrike crash was the tip of the iceberg

Granting device access to an outside software or services supplier brings with it the risk of:

  • Losing access to network functionality (as occurred with the CrowdStrike event)
  • Unauthorized access to data (is your IP and customer data safe?)
  • Visibility of your business activities through aggregated data

Further, your data security is now dependent on the security practices of a cybersecurity company or cloud services provider.

Consider “mobile device management” or “device monitoring” tools. Most of these are essentially rootkits that give a third party 100% control over your company’s machines. That seems ill-advised for any company with proprietary intellectual property they want to keep secret.

Yes, CrowdStrike screwed up and took down several million Windows computers in a spectacular fashion. But crashing Windows computers is just the tip of the iceberg. The larger threat, which we have collectively and conveniently overlooked, is that some other entity holds power over your business operations.

Advanced security software is essential, but you’re giving someone else the keys to your network under the guise of providing security dashboards.

People worry about Facebook tracking and turn off third-party cookies for their private life, but software like CrowdStrike’s can watch, monitor and track every corporate computer, from the lowest intern right up to the CEO. Cookies are the least of your worries.

Now, even if CrowdStrike is reliable and their software works as intended, what happens if someone hacks CrowdStrike? The attacker would theoretically have access to airlines’ networks, banking networks, and a who’s who of global enterprises. This worries me. It must be evaluated as a risk if you grant a supplier such extensive network access.

So, as a CIO or CISO, how do you mitigate the risk of another large-scale failure by these big-tech players?

Prepare for failure: Plan for it, practice it, expect it

The key to mitigating another large-scale system failure is to plan for catastrophic events and practice your response. Make dealing with failure part of normal business practices. When failure is unexpected and rare, the processes to deal with it are untested and may even result in actions which make the failure worse.

Build a network and a team that can adapt and react to failures. Remember when insurance companies ran their own data centres and disaster recovery tests were conducted twice a year? Few companies go that far with continency planning anymore, but some, like Netflix, are setting a good example with chaos engineering. Netflix’s Chaos Monkey open-source software introduces intentional disruptions to a system, simulating real-world failures to test a system’s resilience.

Be more like Netflix; less like Delta Airlines: Delta’s critical crew tracking system was offline for the better part of a week following the CrowdStrike update.

Diversify your suppliers and systems

The second strategy for minimizing large-scale failures is to avoid the software monoculture created by the concentration of digital tech suppliers. It’s more complex but worth it.

Some corporations have a policy of buying their core networking equipment from three or four different vendors. Yes, it makes day-to-day management a little more difficult, but they have the assurance that if one vendor has a failure, their entire network is not toast. Whether it’s tech or biology, a monoculture is extremely vulnerable to epidemics which can destroy the entire system.

In the CrowdStrike scenario, if corporate networks had been a mix of Windows, Linux and other operating systems, the damage would not have been as widespread.

For the “diversify your systems” school of thought, the Rogers Communications outage in Canada in July 2022 stands as an example. The Canadian telecom provider experienced a major service outage of its cable Internet and cellular networks, affecting more than 12 million users for up to 26 hours.

Recovery efforts were hampered because Rogers employees tend to be users of the Rogers cellular and internet systems that crashed. Workers who weren’t at the office couldn’t access the internet or even use their cell phones. A third-party review noted that Rogers staff could not access critical error logs detailing the root cause of the outage until 14 hours later.

Conclusion

Third-party software suppliers and cloud services are an integral part of the IT landscape, but if we want to minimize the risk to our businesses, we must resist the temptation to put all our eggs in one basket.

The lessons from CrowdStrike are: Diversify your suppliers and systems, and dust off your contingency plans.

Don't miss