Surviving and recovering from network interruptions
Ongoing changes to network and security device configuration are unavoidable and necessary for business. But they are also risky. They can have unexpected consequences – from service interruptions to performance degradation and even downtime. How can you reduce the risk associated with configuration changes? Here is a 3-tier strategy:
Reduce the likelihood of configuration errors
- Monitor and review changes
- Establish change procedures and processes
- Establish a test plan for all changes.
Detect problems as early as possible
- Monitor the environment
- Listen to your users.
Ensure that you can make a fast recovery if something goes wrong
- Maintain accessible, actionable audit information
- Establish standard recovery procedures.
Finally, implementing solutions that can automate error-prone, repetitive tasks and can maintain vigilance 24 hours a day go a long way to preventing, and recovering from, human configuration errors.
Monitor and review changes
Even if they look simple, all configuration changes should be monitored and reviewed. For example, suppose you’re adding a host to a network group in order to provide access, and you are unaware that the same group is used in a different place to block traffic. Another pair of eyes will often catch something you missed.
Establish change procedures and processes
Change requests must be communicated consistently so that the right people can review them and assess their impact. Many problems can be avoided simply with good communication. Some organizations schedule weekly change review meetings to understand and plan complex changes. But the most effective way to ensure that changes are reviewed and approved is by enforcing a change process workflow.
Establish a test plan for all changes
It may sound surprising, but many changes are tested for hours or days after implementation, while some are never tested at all. A test plan for every change is a critical part of the change process. Sometimes this isn’t as easy as it sounds and involves coordinating end users, business partners, and professional testers. The work you put in here will give your team a reputation for doing things right.
Monitor the environment
The firewall environment should be continuously monitored and abnormal behavior should automatically trigger alerts. The firewall environment might include the operating system, the network interfaces, the firewall software, the firewall hardware, and the firewall rule base. These should be analyzed and correlated and, if necessary, escalated for a closer look.
Listen to your users
A helpdesk should be in place so that users can easily report problems. The helpdesk should be manned with trained personnel and have clear processes for handling incidents. Have a plan for correlating multiple incidents to a single problem. Each team should have tools to assist root cause analysis before escalation to the next level.
Maintain accessible, actionable audit information
Each and every change must be documented properly and recorded in an audit trail. A comprehensive audit trail should include the target device, the exact time of the change, the configuration details, the people who were involved (requestor, approvers, implementer), and the change context such as the project or application.
But a detailed audit trail is not enough on its own. The information must also be presented in an easy-to-read format so that you can easily access it when needed. Additionally, you’ll want to have filtering and querying capabilities on top of the data to speed up searches and lookups.
Prepare for rapid recovery
Now comes the incident. Despite everything, something bad has happened and you need to respond. You will be judged by the time it took to recover, so you want to be well-prepared with tools, staff and processes to handle this. You want to keep stress down to a minimum.
If you have set up the procedures above, you are already in pretty good shape. Either you caught the problem during the change process or, if it was missed, you can discover it early, before users and services are affected. Thanks to the audit trail, you know exactly what changes have been made lately, by whom, and why. Experts agree that most recovery time is spent figuring out what changed; so if you already know, recovery times are going to be much shorter. You run some quick queries to pinpoint likely suspects and you can roll-back the changes quickly.
There are a number of tools on the market that can help you control changes, detect problems, and recover from errors – this could make your life a whole lot easier which provides:
- A complete audit trail with full accountability and integration with ticketing systems
- Comprehensive change reports and side-by-side diffs for rule bases, objects and textual configurations
- Real-time change notifications with filtering (by change type, device, affected networks)
- Central console for viewing all recent changes across all devices regardless of vendor and model
- A policy analysis tool for determining which firewalls and rules are blocking services across an environment
- Rule and object change history reports
- Business process automation to manage the change process and integration with existing ticketing systems.
You can recover from configuration mistakes – it’s a case of putting in the right rules and procedures and combining these with the right tools.