Database protection – how to back up and restore databases and applications

Although structural data constitutes a decidedly minor part of company resources, their importance can be invaluable. For this reason, database protection requires particular attention.

Nowadays, data is a more valuable resource than a fleet of cars, although the bosses of many firms still pay more attention to looking after cars than digital resources. But this is gradually starting to change. Cyberattacks, equipment outages, and other chance events that lead to data loss make entrepreneurs more willing than ever to invest in data protection tools.

The vast majority of data – over 80% – is non-structural data: video and text files, emails, and materials in PDF format. At the same time, such resources are indeed tempting morsels for hackers. Equally exciting data can be found in databases, which contain so-called structural data. Although this data does not take up a significant proportion of the terabytes stored on servers, in this case, it is not the quantity but the quality that matters. The lion’s share of information stored in databases is of critical importance for the functioning of organizations. As a result, it requires special protection and the use of suitable backup tools.

Note that application backup is a generic mechanizm in Storware vProtect. Applications such as databases: MS SQL, MySQL/MariaDB, Oracle DB, PostgreSQL, Relax & Recovery, Git repository usually have their own mechanisms that guarantee consistent backups. As we are aware, in many situations you need to have the option to customize the backup process – LEARN MORE.

Database backup specifics

Selection of the correct backup and structural data recovery software requires understanding their unique structure and operations. For several reasons, structural data cannot be archived in the same way as non-structural data.

Firstly, databases are usually stored in data files, which are constantly changing (as long as the database is updating). This means that we cannot simply make backup copies as we do for every other file. Secondly, most databases have a type of log, which can be used to restore transactions from a specific point in time or to withdraw partly completed transactions after an outage.

Thirdly, the possible recovery point objective (RPO) in the typical single-stage recovery of most backup systems – probably over 24 hours – is too long for databases of critical importance. The task is also not made any easier by the existence of at least 13 database models. Knowing how to carry out backup starts from what type of database the backup copy is being made for.

Relational and non-relational databases

One of the most common ways of differentiating between databases in terms of the data structure is the division into relational and non-relational. The former is characterized by ACID (atomicity, consistency, isolation, durability), a group of properties that guarantee correct transaction processing. Examples of relational databases are Oracle, SQL Server, MySQL, and PostgreSQL. They are often called SQL databases due to the query language they use. The relational model ensures high-speed reading and data consistency. The built-in transaction systems are ideal for banking, retail, and personnel files.

The creators of non-relational databases, also known as NoSQL, forgo consistency in favor of write speed. In the relational model, data is presented in the form of two-dimensional tables. The columns contain the table attributes, and the rows the values for the given attributes. A key identifies every row. Non-relational databases, meanwhile, can have a highly varied structure depending on the implementation, e.g., tables, documents, graphs. Most commonly, however, this is a list of key-value objects.

In the case of non-relational models, objects do not have to be homogeneous in structure, which means there is no pre-defined way to save data. The most popular NoSQL databases include DataStax, Apache Cassandra, MongoDB, Amazon DynamoDB, Google BigTable, and Redis. Non-relational databases are useful for high-volume data that will be relatively easy to scale horizontally – in clusters and on many servers.

According to DATAINTELO, in 2019, the global NoSQL database market was 3.5 billion USD and is expected to reach 21 billion USD by 2026. In the years 2020-2026, the yearly rate of increase will be 31%. So why are businesses betting on NoSQL? This type of database provides simple horizontal scaling, which is faster and more efficient than SQL databases. The increase in the need for NoSQL database solutions is driving cloud services and the growing penetration of the Internet of Things. However, it must be emphasized that despite the upward trend in the popularity of non-relational models, relational databases still have a dominating position in the current market, and shortly, this is unlikely to change.

Backup for non-relational databases

For organizations using NoSQL databases, creating a backup is a complex process. As we mentioned earlier, non-relational databases are ideal for Big Data environments. Although command lines can be used to create backup and recovery, CLI interfaces themselves are insufficient to ensure automated, error-free backup creation. In such scenarios, developers must write wrapper scripts for each data store to automate the process of creating a backup on each node, manage the space on each node, and clean up older backup copies that are no longer needed. One of the most significant challenges when implementing NoSQL databases in companies is the availability of comprehensive solutions for backup and recovery.

It is worth going back over a decade to when virtual machines had to deal with a similar problem. At that time, Veeam came to the rescue by releasing special software for backing up virtual machines. However, in the case of NoSQL, part of the difficulty is that every backup provider that uses SAN arrays and storage-based snapshots becomes of no use. Non-relational databases only use standard servers directly connected to mass storage, not SAN systems. What is more, the dynamic character of NoSQL databases means that existing backup solutions are insufficient. For example, adding additional NoSQL compute nodes and removing existing ones means moving data between nodes. When such topology changes occur, the historical backup becomes unimportant as the data distribution in the nodes has changed.

A precursor to non-relational database backup solutions is the American start-up Datos IO. Five years ago, the firm presented the Recover X solution based on their CODR engine (Consistent Orchestrated Distributed Recovery). This works by removing duplicated data blocks from the version store, thus significantly reducing the need to create a full backup on a large scale. Thanks to CODR, versioning, events, notifications, time signatures and alerts are tracked. All this data is used to minimize the amount of replicated data stored in cheaper objects or files. It is worth noting that in 2018, Datos IO was taken over by the company Rubrik.

Consistency models

There are two ways in which databases ensure that the views of inserted or updated database data are the same for all viewers. These are called consistency models and have a crucial influence on backup creation and recovery. The first is immediate consistency, also known as solid consistency, which ensures that all users see identical data simultaneously, irrespective of where they are viewing the data from. Most traditional relational databases follow this model. The second model is an eventually consistent or weakly consistent database, which ensures that a given attribute will ultimately be constant for all viewers, but it may take some time. A perfect example of eventual consistency is the DNS system, which must wait until the DNS record lifetime has expired before updating information about domain names. This process can last up to 72 hours.

Summary

As can be seen, creating database backup requires a good dose of knowledge. Of key importance is the place where the data is stored and its structure and how it operates. It is essential to understand where to store data files, whether to use complex or simple transactions and where to store the log for these transactions. In addition, you need to know how to achieve consistent backup for the stored data and the transaction log.It is also of crucial importance how distributed a database is. Is it divided into partitions but with everything under one host, or is it divided and spread over dozens or hundreds of hosts? If it is the latter, we are most probably dealing with an eventually consistent database. Achieving a consistent snapshot for a database distributed over hundreds of nodes is pretty tricky, let alone restoring such a database. Some may think that an eventually consistent database that uses replication on many nodes does not need to be archived, but it most certainly does. Although it may protect against node failure, it does not protect against human error.

text written by:

Pawel Maczka, CTO at Storware