Resilience over reliance: Preparing for IT failures in an unpredictable digital world

No IT system — no matter how advanced – is completely immune to failure. The promise of a digital ring of steel may sound attractive, but can it protect you against hardware malfunctions? Software bugs? Unexpected environmental conditions? Cybersecurity threats? Human error? And that’s just for starters.

IT failures

As Gartner explained last year in How to Prepare for Cloud Outages: “All systems are subject to failure. We cannot purchase hardware that never breaks, we cannot build software that is entirely bug-free – and, most importantly, we must always live with human error. It is impossible not to make errors that can potentially cause downtime, degrade service, or result in data loss. However, we can try to reduce the impact of failures.”

For me, that should be the starting point for any conversation about cyber security — whether it’s in the boardroom or a security operations center – because it is only by being prepared and on our toes that we can have any hope of staying safe. And should the worst happen — when the worst happens — having the right processes and protocols in place to deal with such problems means organizations are prepared for recovery.

Diversification of IT infrastructure

One of the first things to address is a lack of robust recovery mechanisms that are not overly reliant on a single system. As a strategy, this putting-all-your-eggs-in-one-basket approach has been employed by IT departments for decades in the interest of reducing cost and simplifying operations. But thanks to the increasing complexity and scale of modern operations, this once reliable strategy is now starting to show cracks.

This was highlighted earlier this summer when the global Crowdstrike IT glitch hit everything from healthcare to transport infrastructure.

The outage wasn’t the result of a malicious action, but the impact reverberated around the world and showed how easily things can grind to a halt when a problem arises. It exposed and exploited a risk inherent in our current IT systems, where a single failure can lead to widespread disruption.

That’s why IT teams need to consider diversification — using a “platform of platforms” — which would ensure that different systems can operate – and be restored – independently while supporting each other during crises.

By embracing multiple vendors and hybrid cloud environments, organizations would be better prepared so that if one platform goes down, the others can pick up the slack. While this strategy increases ecosystem complexity, it buys down the risk accepted by ensuring you’re prepared to recover and resilient to widespread outages in complex, hybrid, and cloud-based environments.

Data confidence and real-time monitoring

Taking such action isn’t something you simply bolt on. It requires a shift in thinking and a new strategic approach. It’s an admission that in today’s fast-paced world, IT teams can no longer afford to be reactive. They need complete visibility over their entire IT infrastructure with real-time access to accurate data.

This level of monitoring and foresight is crucial if they want to pre-empt issues before they spiral into larger problems. The last thing an organization wants to do during an outage is burn valuable time collecting data – that may be stale or inaccurate – to triage and plan next steps.

The ability to detect, analyze, and address potential failures in real time is a cornerstone of effective IT management. That’s why IT teams must invest in tools that not only provide visibility but also offer automated alerts and predictive insights.

But they also need defense in depth and the resources to employ multiple layers of security and operational controls to safeguard systems. It’s a similar approach to using multiple vendors — in this case, if one line of defense fails, others remain intact to protect the system from escalating threats.

Each layer — whether it’s firewalls, encryption, access controls, or incident response mechanisms — works in tandem to ensure system resilience. For example, a breach in one system can be mitigated by other protective measures which, in turn, help to prevent a cascading failure.

Risk prevention and business continuity

It’s clear that IT failures aren’t just a possibility — they are inevitable. Simply waiting for things to go wrong before reacting is a high-risk approach that’s asking for trouble. Instead, organizations must go on the front foot and adopt a strategy that focuses on early detection, continuous monitoring, and risk prevention.

This means planning for worst-case scenarios, but also preparing for recovery. After all, one of the planks of IT infrastructure management is business continuity. It’s about optimal performance when things are going well while ensuring that systems recover quickly and continue operating even in the face of major disruptions.

This requires a holistic approach to IT management, where failures are anticipated, and recovery plans are in place. Investing in resilience now means fewer disruptions in the future, stronger operational stability, and ultimately, a competitive edge in today’s fast-moving digital world.

That means adopting a forward-thinking approach to IT resilience focusing on diversification, real-time monitoring, proactive risk management, and layered security. After all, the risk of failure is not a question of if, but when.

Don't miss