en

Data Protection: The Era of Petabytes is Coming

IDC analysts predict that global data growth will reach 175 zettabytes by 2025. Most of this will be unstructured data requiring adequate protection.

Storage system providers from overseas have already been receiving inquiries about solutions designed to store exabytes of data. Importantly, such questions aren’t only coming from hyperscalers. The ongoing tech race is driving data creation from emails, documents, social media, and other materials, transforming business communication and operational processes. This is generating a sea of unstructured information in companies and institutions.

Petabytes Are the New Normal

While smaller and medium-sized companies still work with terabytes of data, larger organizations are increasingly surpassing the petabyte threshold. Over half of large enterprises manage at least 5 PB of data, with 80% of it being unstructured. Additionally, 89% of this data resides in cloud environments (hybrid, public, and multi-cloud). Data growth is no surprise; it’s been talked about for years. What is surprising, however, is the pace, driven recently by phenomena such as the Internet of Things, High-Performance Computing, machine learning, and artificial intelligence.

Protecting petabytes of data is becoming a major challenge. One difficulty with conventional backup systems based on the Network Data Management Protocol (NDMP) is the time it takes to create full backups. This process can take days or, in extreme cases, even weeks due to network overload. NDMP is slow and simply fails at the PB scale. Another difficulty is data scanning before backup to detect any changes.

Backup Complexity at Scale

Incremental backups are a critical optimization strategy, but at the PB scale, identifying which files have been modified can be extremely time-consuming and resource-intensive. After backup, most companies (along with regulatory requirements) require testing, adding even more days to the process.

It may take some time before repositories with storage capacities in the petabyte or exabyte range become commonplace, but for smaller entities, managing even tens of terabytes can pose real challenges. To make matters worse, as the saying goes, “troubles come in pairs.” Rapidly filling disks and tapes are not the only challenges facing storage system providers and their users. Customer demands and IT’s vital role across almost every industry mean that backup and disaster recovery (DR) requirements are evolving quickly. It’s no longer enough to create and encrypt backups. Organizations are focusing on other aspects like continuous data protection (CDP), security and compliance, bare-metal recovery (complete servers with operating systems, files, and configurations), reducing backup windows, and faster file recovery.

Backups Under Scrutiny

Until recently, petabyte-scale backups were uncommon. However, data growth and relatively new trends, such as advanced analytics and AI modeling, are making data increasingly valuable and therefore requiring protection. It’s worth noting the emerging trend of building small language models. As experts rightly point out, a CEO doesn’t need information on Pink Floyd’s discography or descriptions of all Robert De Niro movies but instead needs valuable insights for effective business management. Hence, there’s a growing conversation around developing smaller language models, trained on less data. These models are cheaper to run than ChatGPT and Claude, can be deployed on local devices, but also require gathering more data for model-building.

Backup is the last line of defense against attacks, sabotage, or hardware failures. For petabyte-scale data sets, even a minor data loss can be catastrophic for a company. However, storage administrators are not defenseless. Data is on their side. Some backup and DR tools provide insights into backup performance, capacity usage, and error trends. Predictive analytics with machine learning can forecast storage needs and potential failures, while reporting dashboards help visualize trends, assess compliance, and streamline recovery planning.

The Art of Data Management

A company with a few terabytes of data usually doesn’t place much importance on managing it due to the low storage costs. However, as digital assets grow, managers start recognizing the associated costs. Therefore, the clear rise in unstructured data requires appropriate steps to not only reduce costs but also enhance information security. Indeed, some manufacturers have recognized new data management needs. In recent years, new product groups like Data Security Posture Management (DSPM), AI Enablement, and Governance, Risk, and Compliance (GRC) have emerged. These are currently niche products, but their role is expected to grow over time.

The Art of Managing Growing Digital Assets

For now, many companies struggle to answer seemingly simple questions: How many snapshots did you generate last year? How many of them remain in the environment? When was the last time you accessed files created five years ago? Organizations approaching the petabyte boundary may start seeking answers to these questions. They will then more easily see the savings that result from rational data management. It’s worth the effort, and the more data, the greater the savings. This isn’t only about money spent on new storage devices but also about penalties from cyberattacks or regulatory non-compliance.

Limiting bad practices that lead to unnecessary data accumulation is the first step to clearing out archives. The second step involves organizing files on appropriate “shelves,” or implementing tiered storage solutions. Categorizing data based on its importance and access frequency allows for optimized storage costs. For example, some data can be moved to cheaper storage for six months. If it turns out within this period that someone frequently accesses these files, they can be returned to a more efficient disk. However, data unused for longer periods—such as 24 months, if not subject to specific archival regulations—can be permanently deleted.

Cutting Down on Redundant Data

Another way to eliminate unnecessary data is through deduplication and compression techniques. Deduplication reduces storage needs by eliminating duplicate copies of repeated data, significantly reducing the amount of data that must be stored and backed up, thus lowering storage costs. There are two types of deduplication: inline (data is deduplicated “on the fly,” before it reaches the device) and traditional (where deduplication occurs after data is saved to storage).

Meanwhile, compression reduces file sizes and can be lossless, ideal for critical business information, or lossy, which reduces file size by discarding some data.

Managing backups containing massive data sets on a limited budget is a significant challenge for companies today. However, implementing strategies such as tiered storage solutions, data deduplication, and compression techniques allows companies to optimize storage and backup costs.

text written by:

Grzegorz Pytel, Presales Engineer at Storware