Disaster Recovery, or How to Get Your Business Back on Track

Rapid business continuity after unplanned IT downtime keeps financial and reputational losses to a minimum. However, a relatively painless return to normality is possible provided the company has an adequately prepared Disaster Recovery plan.

Disaster Recovery Plan

According to Vertiv and the Ponemon Institute research agency, the average company loses around $505,000 due to a 90-minute outage. The causes of downtime can range from natural disasters, hardware or software problems, sabotage, employee mistakes, and blackouts to malicious attacks carried out, for example, by ransomware.

According to a survey conducted by LogicMonitor, as many as 53% of IT decision-makers believe that there is a high probability that their company will experience a failure severe enough to make the headlines. Many companies are introducing solutions to offset the effects of cyber-attacks, natural disasters, or simply human carelessness. The systems and procedures involved are constantly evolving. Until recently, backup and recovery focused on the former. However, data recovery tests were performed once a year or not at all. Although administrators ticked off backup in their schedule, restoring the environment after a disaster was not always possible.

However, growing cyber gang activity prompts businesses to become more involved in Disaster Recovery (DR) processes. According to ESG, 75% of the companies surveyed have a Recovery Time Objective (RTO) of 4 hours or less. It is crucial to determine how long the organization can afford to shut down the software so that the downtime does not lead to serious losses when determining the RTO. For e-commerce activities, an RTO of a few hours is difficult to accept. So what can be done to perfectly tailor a DR plan to an organization’s individual needs?

  • The first step is an initial risk assessment and evaluation of the impact of incidents on the business. At this stage, the weakest points in the infrastructure are identified, and the likely risks are analyzed.
  • The next move is to establish a DR strategy and a detailed recovery plan at different levels of downtime.
  • Finally, companies should train staff on disaster recovery procedures and manage external relationships.

Developing a DR plan is not a one-off process and requires continuous improvement. In the event of a disaster, outage, ransomware alert, or other disruption, you need to be able to get your organization back online quickly. Given the rapid changes in today’s IT environments, preparedness may not be something you can afford to do annually or quarterly. Meanwhile, according to ESG, 45% of organizations assess their data protection strategy once a year and modify it if necessary. Experts recommend that the final steps in the DR planning process include audit and review plans and the establishment of a continuous improvement process for the entire IT DR program.

Business dependence on new technologies

Security experts recommend that the most important business functions performed by an enterprise and the IT systems that support them should be identified before risk assessment. This analysis makes it easier to identify the consequences of disrupting core business processes:

  • loss of customers,
  • deterioration in financial performance,
  • a reduction in employee productivity,
  • disruption of supply chains
  • loss of reputation.

A risk analysis should be carried out only after identifying the most important business activities and the systems and data that support them. In coordination with IT departments, managers should pay particular attention to several issues:

  • understanding the operation of each business unit;
  • identification of critical business unit processes that depend on IT;
  • the financial value of critical business processes (for example, revenue generated per hour);
  • dependence on external organizations, especially cloud services;
  • data requirements;
  • system requirements;
  • the minimum time needed to restore data to its previous state of use;
  • the minimum time required to restore IT operations to normal or near-normal mode;
  • the minimum number of staff required to carry out the activity.

How to assess the risk of an outage

The risk assessment examines internal and external threats and vulnerabilities that adversely affect IT assets. The IT department most often focuses on one or more of the following risk scenarios, the loss of which would harm the organization’s ability to do business:

  • loss of access,
  • data loss,
  • loss of function,
  • loss of skills,
  • loss of control.

The key issue is whether the company uses its server room and resources or cloud services. In the first option, DR planning and management can be managed comprehensively on site. In contrast, the cloud model entrusts control of many functions to a third party. As a result, IT bosses need to decide whether the risk of using the cloud is worth taking. While there are many ways to conduct a risk analysis, it is best to start with the simple approach presented in the table below.

Table 1 shows a simple approach that can be easily implemented. The challenge is to review assumptions about risk factors with senior management.

A simple risk assessment

 
Event Likelihood (a)  Impact (b)   Risk weight factor   (axb) 
Fire do datacenter  0,4  0,9  0,36
Loss of power   0,5  0,8  0,4
Loss of staff  0,8  0,5  0,4
Severe weather  0,4  0,9  0,36
Water leak in datacenter  0,3  0,6  0,18
Theft of user data caused by rogue cloud service employee  0,4  0,9  0,36
Employee forgot to log off workstation  0,6  0,3  0,18
Cloud vendor had a major security breach  0,7  0,9  0,63

 

Table 1 shows realistic examples of risk events in local and cloud environments. Based on experience and available statistics, e.g., from insurance companies, it is possible to estimate the probability of certain events occurring on a scale from 0 to 1. The same operation can be performed for the event’s impact, using a range from 0 to 1 (0.0 = no impact and 1.0 = total loss of operation). The last column contains the result of the probability multiplied by the impact. This becomes a ‘risk severity factor.’ Situations with the highest risk severity factors become events that DR plans should address.

Backing up

An essential part of any contingency plan is backing up critical data. This plan should take into account, among other things, the resources to be backed up, the frequency of backups, and where they are stored. The Disaster Recovery Plan document should include two indicators – Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

The RPO indicates the time covered by the last backup performed or data transfer to the Disaster Recovery Center. The RPO should be just a few minutes if significant changes are made to the application every few minutes. If updates occur less frequently or are of lesser importance, the RPO may be a few hours.

Recovery Time Objective (RTO) is a measurement that indicates the maximum time for recovering from a disaster – that is, the maximum amount of time a system can be offline. The RTO is often a Service Level Agreement (SLA) component. The lower the Recovery Point Objective and Recovery Time Objective, the higher the cost of disaster recovery.

The three biggest obstacles to Disaster Recovery planning

Skillful implementation of DR in an organization minimizes downtime and marginalizes risks of data loss in infrastructure failure. This is, therefore, time well spent, but there are various obstacles in the way of this plan. Special attention should be paid to three issues:

  • implementation costs,
  • the location of the data to be recovered and
  • vendor lock-in risk.

It’s not setting a lofty RTO and RPO that is hard but adapting them to the company’s current needs and the business value of the data. Security experts recommend creating multiple layers of DR to effectively set SLAs. On the other hand, they advise against using several point technologies – a single solution is more optimal. It is also worth ensuring that critical data can be recovered locally, in the cloud, or from the cloud. Finally, limiting disaster recovery to a specific facility, hardware, or hypervisor can only increase the cost of the overall DR process.

What a good DR plan should be like

As the importance and value of data increases, the willingness to protect and recover it takes on added importance. An organization needs to recover data quickly and with as little disruption as possible. Full disaster recovery readiness must have several key features:

  • simplicity: easy management and automated reporting through a single platform and an intuitive user interface;
  • flexibility: fast replication of virtual machines, applications, and snapshots;
  • cost-effectiveness: possibility of autoscaling up as well as down to control expenditure;
  • security: protection against ransomware and comprehensive security with encryption for data both at rest and in transfer;
  • versatility: flexible copy data management to verify recovery, DR replicas and support DevOps, testing, and analysis.

No one has the recipe for a perfect DR plan, especially as it is difficult to estimate the ingenuity of hackers, the stupidity of employees, or the scale of natural disasters. However, sticking to specific guidelines and rules ensures that you can get your business back on its feet relatively quickly and efficiently when disaster strikes.

Paweł Mączka Photo

text written by:

Pawel Maczka, CTO at Storware