en

Backup for Structured and Unstructured Data

Data protection requires administrators to consider several important issues. The type of data, its location, and growing capacity requirements are of key importance.

The division of data into structured and unstructured data has existed for many years. Interestingly, as early as 1958, computer scientists were showing particular interest in the extraction and classification of unstructured text. But these were just scientific disputes. Unstructured data entered the mainstream a dozen or so years ago. At that time, analysts at IDC began to warn of the impending avalanche of unstructured data. Their predictions proved to be accurate. It is estimated that they currently account for around 80% of unstructured data, and even 95% in the case of Big Data sets. Their amount doubles every 18-20 months.

Structured and Unstructured Data

Aron Mohit, founder of Cohesity, compared data to a large iceberg, with structured data at the top, protruding from the surface of the water, and the rest being what is not visible. Unstructured data is found almost everywhere: in local server rooms, the public cloud, and on end devices. They do not have a predefined structure or schema, they exist in various formats, often occur in a raw and unorganized state, can contain a lot of information, which makes them usually difficult to manage. The lack of structure and a standardized format makes them difficult to analyze. Examples of unstructured data include texts such as emails, chat messages, and written documents, as well as multimedia content such as images, audio recordings, and videos.

Somewhat in the shadow of unstructured data are structured data. As the name suggests, they are organized and arranged in rows and columns. The structured format allows for their quick search and use, as well as high performance of operations. Although structured data represents only the tip of the iceberg, its role in business remains invaluable. They are commonly found in financial documentation in the form of transaction records, stock market data, or financial reports. Structured datasets are crucial for analyzing market trends, assessing investment risk, and facilitating financial modeling. They also play a significant role in healthcare. Organized patient documentation, diagnostic reports, and medical histories help ensure continuity of patient care and support medical research. Among e-commerce companies, structured data includes product catalogs, customer purchase histories, and inventory databases. With this information, marketers can implement personalized marketing strategies or better manage customer relationships.

Protecting Unstructured Data

Staying with Aron Mohit’s parallel, unstructured data is the invisible part of the iceberg, hiding many surprises. It includes many different types of information, such as Word documents, Excel spreadsheets, PowerPoint presentations, emails, photos, videos, audio files, social media, logs, sensor data, and IoT data. Unfortunately, the mountain continues to grow. And it is precisely the avalanche-like growth of data, as well as its dispersal, that poses considerable challenges for those responsible for its protection.

On NAS servers, in addition to valuable resources, there is a lot of unnecessary information, sometimes referred to as “zombie data”. Storing such files reduces system performance and unnecessarily generates costs, which translates into the need for more arrays or wider use of mass storage in the public cloud. According to Komprise, companies spend over 30% of their IT budget on storage.

Unnecessary files should be destroyed or archived, e.g., on tapes, if required by regulations. This has never been an easy task, and with the boom in artificial intelligence, it has become even more difficult. Organizations are collecting more and more data, on the assumption that it may be useful for training and improving AI models.

It should also be borne in mind that unstructured data sometimes contains sensitive information, e.g., about health or allowing the identification of specific individuals. Finding them is more labor-intensive than in the case of structured data, due to the loose format. However, the organization must know what they contain in order to locate them quickly if necessary.

A separate issue is the progressive adaptation of the SaaS model. In this case, service providers do not guarantee full protection of data processed by cloud applications. As a result, service users must invest in special tools to protect SaaS. As you can easily guess, vendors provide solutions for the most popular products, such as Microsoft 365. But according to the “State of SaaSOps 2023” report, the average company used an average of 130 cloud applications last year. It is easy to imagine the chaos, and therefore the costs, if an organization had to implement a separate tool for at least half of the SaaS used.

Protecting Structured Data

At first glance, everything seems simple, but the devil is in the details. The choice of the appropriate methodology usually depends on two factors: frequency, data quantity, and the amount of data changes. In the first case, critical databases typically require multiple backups created daily, while for less critical ones, a backup performed every 24 hours or even once a week may suffice.

Another issue is the amount of data. The administrator balances between three options to avoid overloading the network bandwidth or filling up server disks. The most common method involves creating a full copy of the entire database, including all data files, database objects, and system metadata. In case of loss or damage, a full backup allows for easy restoration, providing comprehensive protection. This method has two drawbacks: it generates large files, and creating copies and restoring the database after a failure takes a considerable amount of time.

Therefore, for backing up large databases, the incremental option seems better. This method involves saving changes made since the creation of the last full backup. This method does not require a lot of disk space and is faster compared to creating full backups. However, recovery here is more complex because it requires both a full backup and the latest incremental backup.

Another option is transaction log backup. The process involves recording all changes made to the database through transaction logs since the last transaction log backup. This method allows restoring the database to the exact moment before the problem occurred, minimizing data loss. The disadvantage of this method is the relatively difficult management of backup copies. Additionally, full transaction log backups are required for restoration.

Nowadays, when everything needs to be available on demand, companies are moving away from archaic methods that require shutting down the database engine during backup. New solutions allow creating a backup copy of all files located in the database, including table space, partitions, the main database, transaction logs, and other related files for the instance, without shutting down the database engine.

Protecting NoSQL Databases

In recent years, NoSQL databases have grown in popularity. As the name suggests, they do not use Structured Query Language (SQL), the standard for most commercial databases such as Microsoft SQL Server, Oracle, IBM DB2, and MySQL.

The biggest advantages of NoSQL, such as horizontal scalability and high performance, make them suitable for web applications and applications containing large amounts of data. However, these advantages translate into difficulties in protecting applications. A typical NoSQL instance supports applications with a very large amount of rapidly changing data. In such a case, a traditional snapshot is not suitable. Additionally, if the data is corrupted, the snapshot will restore the corrupted data. Another serious problem is the lack of NoSQL compliance with the ACID principle (Atomicity, Consistency, Isolation, Durability), unlike conventional backup tools. As a result, it is impossible to create an accurate “point-in- time” backup copy of a NoSQL database.

Conclusion

Multi-point solutions with various interfaces and isolated operations make it impossible to obtain a unified view of the backup infrastructure and manage all data located in the on-premises environment, public clouds, and the network edge. There are strong indications that the future of data protection and recovery solutions will be dominated by solutions that consolidate many point products into a platform managed through a single user interface. Customers will increasingly look for systems that offer scalability and support a comprehensive set of workloads, including virtual, physical, cloud-native applications, traditional and modern databases, and storage.

For those seeking a comprehensive backup and recovery solution for both structured and unstructured data, Storware Backup and Recovery stands out as a top choice. Its versatility goes beyond basic file backups, offering features like agent-based file-level protection for granular control, hot database backups to minimize downtime, and virtual machine support for a holistic data protection strategy. This flexibility ensures your critical business information, whether neatly organized databases or creative multimedia files, is always secured with reliable backups and efficient recovery options.

Paweł Mączka Photo

text written by:

Pawel Maczka, CTO at Storware