4 Factors to Consider in Disaster Recovery

Firms are becoming ever more concerned about IT infrastructure outages due to faults or cyberattacks. For this reason, processes, policies and procedures related to restarts and maintaining critical ICT infrastructure are of growing importance.

Disaster Recovery – how to keep losses to a minimum?

Ransomware is causing havoc in networks used by companies, hospitals, colleges of higher education and government institutions. We covered the global cost of ransomware attacks in an article Cyber Attacks – The Plague of the 21st Century. One of the most drastic effects of this type of attack is the shutdown of ICT infrastructure. For example, Colonial Pipeline came to a standstill for six days due to such an attack, while CompuCom (a service management provider) got back on its feet only after two weeks.

There are cases when the victims of such attacks pay the ransom immediately so as not to delay work on important projects. One such example is the University of California in San Francisco, which paid the blackmailers 1.14 million dollars. According to Bloomberg, the university was conducting advanced tests on coronavirus antibodies. Examples like this can be recounted almost endlessly.

It is not surprising that organizations look at cyberthreats not only in terms of data leaks and the amount of ransom paid, but also from the perspective of the costs of a shutdown. Based on the calculations of the Uptime Institute, in 40% of cases these costs run to 100,000 dollars or more. As many as 95% of firms that took part in a Veeam survey had experienced an unplanned shutdown due to a fault or a cyberattack. According to the majority of respondents, the acceptable downtime for a business due to such problems should not be longer than 2 hours.

Providers of data protection solutions have been receiving many worrying indications from IT specialists who are finding it ever more difficult to meet the requirements of business departments. The principal reason for this is increasing digitization and the growth in the number of critical applications used in companies. What’s more, firms are acutely aware that the unavailability of services brings not only direct losses, but also carries additional consequences in the form of damage to their image.

There are many studies on Disaster Recovery Planning on the web. On our pages you will also find a lot of additional information on this topic. Below you will find a description of 4 factors that should be taken into account when creating a disaster recovery plan for ransomware attack.

1. RTO – a race against time

Creating backup is only half of the solution. Complete success can only be achieved using data recovery mechanisms. In today’s world, when clients are exceptionally impatient and want to have everything here and now, the time it takes to get infrastructure back online is extremely important. It is not only important to back up, for example, the whole SQL database, but also to ensure its smooth recovery together with the application.

Experts draw attention to the fact that restoring a 1 TB virtual machine takes several minutes and is not especially problematic. The real trouble begins when many systems have to be restored simultaneously following a ransomware attack. Surprisingly, there are relatively few products on the market that have effective mechanisms for data reduction and which ensure the rapid start-up of many virtual machines with the required high efficiency. The latest solutions offer a kind of orchestration of the data recovery process.

Organizations must think about the order for restoring data for different applications in order to achieve the recovery time objective (RTO) and to make the most critical applications available online as quickly as possible. This process must take into account not only how critical an application is, but must also consider the amount of data involved for each application.

Table. Optimizing Restore Order

Source: Gartner (January 2021)

In the example in the table, it’s clear that the restore process for Application No. 2 and Application No. 7 must start sooner than many of the applications with more demanding RTOs, simply because it will take a lot longer to restore all the data. This is relatively simple to work out when you have a handful of applications to restore, but much more challenging when there are hundreds of applications. Modern backup programs can orchestrate the process to schedule the restores in optimum order, including cycling them through the data-cleaning process, so that applications are recovered in time to meet the varying RTOs. In some environments though, even the most carefully orchestrated recovery process will not be able to meet all the RTOs if the backup platform can’t restore and clean data fast enough.

2. Continuous Data Protection – when every second counts

The second key parameter that determines Disaster Recovery processes is the RPO, which defines when and with what frequency backup should be made so that any outage does not have a significant effect on the firm’s operations. Let’s say that the system carries out safety backups every thirty minutes. For most firms this is a sufficient time interval, but if we look at it from the perspective of a bank conducting hundreds of transactions, such an RPO is completely unacceptable. That’s why some backup and DR tools provide CDP (Continuous Data Protection), often described as continuous replication. Thanks to this solution, virtual machines can make backup copies as often as every 15 seconds or so. Continuous Data Protection guarantees that if there is a system fault, only a dozen or so seconds of work will be lost. An RPO with a time interval of between one and a few hours is a solution for semi-critical processes.

This should be understood to mean data that is important, but the loss of which we can live with. Meanwhile, the recovery of data that is updated every several hours or more should not be of such importance to the development of the firm. Experts agree, however, that the pressure for IT systems to work continuously without breaks and for data to be constantly available is greater than ever. And all the indications are that this pressure will continue to grow.

3. Scale out better than scale on

Backup system performance is a key part of a successful restore process. Some providers are slowly moving away from the scale on model that has been used for many years in favor of scale out. The former type of device scaling enables easy increase in capacity so as to fit more backup copies, but their restore speed remains relatively stable. This is due to the limit on computing performance caused by input/output operations.

Meanwhile, instead of adding disks to individual devices in order to increase capacity, scale out systems add devices to the cluster. Thanks to this, every new device improves the input/output computing performance available for restore operations, and also increases the cluster storage capacity. Improvements to all critical aspects of backup platform performance according to capacity has become even more important as more and more functions are being added to backup and restore processes

4. Isolated data recovery environment

In the article on How to Reduce the Damage Caused by a Ransomware Attack, we listed a number of factors that can help limit the effects of a ransomware attack. This time we will consider why an isolated environment is important in the event of an attack? An organization can never be 100% sure that its backups are not infected with malicious software. In most cases, firms do not know when the attackers first penetrated their network. As a result, they have no idea how far back they need to go to find a clean backup. What’s more, data restored from backup made several weeks or months ago is of dubious value, especially if it relates to sales or stock control. This is why some experts recommended starting with the assumption that the backup is infected with malware and that it must be cleaned out during the data restore process. There are two elements to this solution.

  1. The IT department must create an isolated restore environment (IRE), in which applications and data can be restored while remaining separated from the attackers trying to infect the system. This solution uses various scanners which detect malicious software on the basis of signatures, or by using more advanced methods of behavioral detection using artificial intelligence.
  1. The backup software must have the option to restore to the IRE and to organize the scanning process in order to move applications and data through the various stages of recovery before they are returned to production

Finally, we recommend our post on how backups can be more effective than antivirus software for malware.

Pawel Maczka

text written by:

Pawel Maczka, CTO at Storware