Disaster recovery is usually a subset of activities around a larger business continuity initiative in an organization. For this article, we will describe disaster recovery, or DR, as the activities required to restore availability of the computing environment to the organization.
IT teams inside organizations have been practicing DR for years, often by implementing do-it-yourself (DIY) DR in the form of a second stack of equipment located somewhere off site. Often these DIY efforts are missing key elements. One or more of the following 10 steps are commonly left out of the DIY DR process:
- Aligning with an existing business continuity plans from the operations and leadership team: Yes, we are assuming that a business continuity effort exists in the organization. If it does, the IT team’s DR plan needs to align well with the steps and timing of the business continuity plan to bring the business back from an outage.
- Inventorying applications and assigning into two tiers — critical and non-critical: While many companies define unique recovery times and recovery points for each application, a two-tier approach can simplify the IT recovery effort by simplifying conversations with the business. Additional tiers can be added as the DR plan is implemented, tested and refined.
- Discussing application usage, RPOs, and RTOs with the operational team’s individual contributors as well as the leaders: This must be done from a business perspective as well as an IT perspective. Leaders often do not understand the daily usage of critical business applications and the hidden dependencies that the usage patterns create. Individual contributors often have the best insight into what a recovery should look like.
- Understanding system and application dependencies: The DR plan must define the applications that need to be available first, second, third and so-on, based on both business and technical requirements. Too many times we have seen hidden dependencies torpedo an actual DR scenario because the dependencies were not defined or not tested up front.
- Clearly understanding the IT staff effort required to recover the top tier applications: Will it take 36 hours or 360 hours? Do you understand the most critical points in the process? What if the IT staff is unavailable? Staff effort is almost universally underestimated. Make sure your team’s expectations are realistic.
- Building as much automation into the disaster recovery process as your company can afford: Data replication has become easier over the last few years, with dozens of tools available to make sure you have copies of your data assets. But network failover is still very difficult. The location of the primary and secondary facilities and the availability of your network engineers all play a role in the complexity of failing over the network. Automating both the data and the network failover is critical to reliable DR. Depending on heroics from your IT staff in a post-disaster environment is simply bad planning.
- Testing, testing and more testing: The DR environment should be tested multiple times per year with full post-mortems on the outcomes. Lack of testing is a huge contributor to failure in the DR environment.
- Including disaster recovery in all business discussions: Disaster recovery belongs in all application conversations, whether discussing new applications or changes to existing systems. Complexity, risks, and costs can increase if disaster recovery is left out of the planning process.
- Updating the secondary environment EVERY TIME a change is made to the primary compute environment: Your DR plan won’t work without it.
- Ensuring DR process continuity when IT staff changes: Is the DR process documented, or is it all in one person’s head? Is there internal or external continuity of expertise for the DR process? The easiest way to reduce risk is to test after a staff change.
How are companies addressing these overlooked disaster recovery steps? By using industry-recognized Disaster Recovery as a Service (DRaaS) to automate the DR process, drive complexity out, and facilitate easier testing multiple times per year to ensure successful recoveries. Read about how Bouchard Insurance used DRaaS to preemptively avoid the ravages of a hurricane and keep their critical systems up and running here. If you want to know more about how DRaaS might help your company reduce risk, contact me.