en

The Hidden Cost of Unprotected AI Infrastructure

There is a line item that does not exist in most IT budgets: Lost AI Training Compute.It is not there because nobody has lost a 96-hour training run yet. Or rather, because when it happens, it gets absorbed into “infrastructure incident,” “compute overage,” or — my personal favourite — “lessons learned.” The actual cost lands somewhere between a finance report and a postmortem document and then disappears.This article is an attempt to make that cost visible before it happens to you. Not in abstract risk language. In numbers. Because CFOs understand numbers, and the numbers for unprotected AI infrastructure on OpenStack are genuinely alarming once you run them through to their conclusion.Spoiler: the backup is not the expensive part. It never was.

The Invisible Ledger of AI Infrastructure Loss

When a production database goes down and data is lost, the cost is visible and immediate. Transactions fail. Customers complain. Revenue is traceable. The incident is documented with a financial impact figure attached.

When an AI training job loses its data, the loss is slower and subtler. The training run fails. The team restarts from the last good checkpoint — if one exists. If not, they restart from scratch. The GPU cluster continues running. The salaries continue accruing. The lost compute time gets folded into the next budget cycle as slightly higher-than-expected infrastructure spend. Nobody writes a headline about it.

This is dangerous. The absence of a visible loss event does not mean the cost is zero. It means the cost is distributed across four categories that rarely appear on the same spreadsheet:

  • Direct compute loss — GPU hours that produced work now destroyed
  • Reconstruction cost — the human and compute cost of rebuilding what was lost
  • Opportunity cost — delayed model deployment, delayed product launch, delayed competitive advantage
  • Compliance exposure — fines and remediation costs if the loss involves regulated data or regulated AI systems

Let us run each category through actual numbers.

Category 1: Direct Compute Loss — The GPU Math

GPU compute has a market price. On-demand H100 infrastructure pricing varies significantly across hyperscalers and specialist GPU providers, often ranging from several dollars per GPU-hour depending on provider, region, availability model, and configuration. For budgeting purposes, planning on $3–6 per H100-hour on on-demand is reasonable, with boutique specialized clouds offering lower rates and hyperscalers sitting at the high end of that range. On private OpenStack infrastructure, the equivalent cost is the amortized capex of the GPU hardware plus operational overhead — but the compute time is no less real and no less lost.

Run the scenario:

A financial services team is training a risk assessment model on an 8-GPU H100 cluster on their private OpenStack infrastructure. The training run is expected to take 72 hours. At hour 58, a Ceph storage node fails. The checkpoint mechanism was writing to a Cinder volume that was also on the affected Ceph pool. The last recoverable checkpoint is from hour 12. The team loses 46 hours of training progress.

The direct compute cost of those 46 hours: 8 GPUs × 46 hours. On equivalent cloud pricing at a conservative $3/hour per GPU: $1,104 in destroyed compute time. On a private cluster where the total hardware acquisition cost was €1.2M over a 3-year depreciation cycle, the cost per GPU-hour is lower in absolute terms — but the amortized capex loss is still real, and it still appears nowhere in the incident report.

That is the small scenario. Large-scale language model training and fine-tuning workloads on multi-GPU H100 infrastructure can consume hundreds or thousands of GPU-hours depending on model size, optimization strategy, and training objective. At cloud reference rates, that is $10,000 to $50,000 per full training run. If you lose the final checkpoint of a 600-hour run and have to restart from hour 200, you have just destroyed the equivalent of several months of an ML engineer’s salary in compute time. That is not a minor incident. It is a budget event that should have a number attached to it.

Here is the calculation I recommend keeping on hand:

Cluster Size Training Duration Lost Progress (worst case) Compute Cost Destroyed*
4x A100 48 hours 36 hours (75%) ~$270 – $450
8x H100 72 hours 60 hours (83%) ~$1,440 – $2,880
8x H100 500 hours Full run ~$12,000 – $24,000
32x H100 1,000 hours Full run ~$96,000 – $192,000

*Illustrative estimates only. Real-world GPU infrastructure costs vary significantly depending on provider pricing, reserved capacity, infrastructure utilization, optimization efficiency, and workload characteristics.

The uncomfortable observation: the annual cost of a production-ready backup solution for your OpenStack AI infrastructure is almost certainly less than a single row in the middle of that table. The ROI calculation is not complicated. It is just rarely performed in advance.

Category 2: Dataset Reconstruction — The Cost Nobody Budgets

GPU compute is quantifiable because it has a market rate. Dataset reconstruction is harder to price, which is exactly why it never appears in pre-incident risk assessments.

Let me give you a framework.

A training dataset is not a file. It is the output of a data pipeline that includes: raw data acquisition (purchased, scraped, or generated), cleaning and deduplication, domain-expert annotation, quality assurance review, augmentation and preprocessing, and versioning and governance documentation. Each stage has a cost in human time. The annotation stage alone — where human labelers apply labels to raw data — has well-documented market pricing.

Simple labeling tasks such as basic bounding boxes generally start around $0.03–$0.05 per label. Complex tasks like semantic segmentation can cost significantly more per annotation, depending on annotator skill requirements and QA standards. For a computer vision dataset with 500,000 images, each annotated with multiple bounding boxes, annotation costs alone can reach hundreds of thousands of dollars. That is before you account for the data engineering time to build the pipeline, the domain experts who reviewed ambiguous cases, and the QA cycles that achieved acceptable inter-annotator agreement.

Here is the part that genuinely surprises CIOs when they hear it: data annotation cannot be rushed. In many cases, annotation quality cannot be accelerated linearly simply by increasing spend or staffing, particularly for specialist or domain-sensitive datasets. A medical imaging dataset that required months of annotation effort may require a similarly significant reconstruction timeline if lost and rebuilt from scratch. The calendar cost is not negotiable.

This means dataset loss is not just a financial cost. It is a project delay measured in months, attached to a product roadmap that was already committed to stakeholders.

My suggestion — and this is the unconventional one — is to treat your training datasets the way software engineering treats source code. Version control, not just backup. A dataset is a living artifact that evolves through cleaning, annotation, augmentation, and correction cycles. Each version is the product of work that cannot be trivially reproduced. Git for code. Immutable, versioned backup with retention history for datasets. The mental model shift matters: a dataset is not “data to be protected.” It is “an artifact with a provenance history that must be preserved.”

Category 3: Opportunity Cost — The Cost That Never Appears on Any Invoice

This is the category that CFOs understand intuitively and engineers underestimate systematically.

An AI model that is six weeks late to production is not just six weeks late. In some markets, it may represent weeks of unrealized competitive advantage or delayed operational improvement. Six weeks of your competitor’s model in the hands of customers who might have been yours. Six weeks of decisions being made by the old rules-based system that the AI was supposed to replace — with whatever error rate that implies in your specific domain.

For a fraud detection model in financial services, six weeks of delayed deployment against a baseline system has a calculable cost in undetected fraud. For a predictive maintenance model in manufacturing, six weeks means six more weeks of unplanned downtime. For a customer churn prediction model in telecoms, six weeks means six more weeks of customers leaving that could have been retained.

None of these costs appear on the incident report. They are counterfactual — they describe what did not happen because the model was not ready. But counterfactual costs are real costs. They are just invisible ones.

The honest framing for a CFO conversation about AI backup investment is not: “What is the cost of losing our training data?” It is: “What is the cost of our AI product being delayed by eight weeks?” That number is usually significantly larger, and it is the number that justifies the backup budget in a single sentence.

Category 4: Compliance Exposure — The Cost That Actually Has Numbers Attached

The previous three categories are operational costs. This one is regulatory. And it is the one that has made data protection a board-level conversation rather than an IT operations discussion.

The EU AI Act enters full enforcement on August 2, 2026, for high-risk AI systems — which covers AI operating in financial services, healthcare, critical infrastructure, employment, and law enforcement contexts. Penalties for non-compliance reach up to €35 million or 7% of global annual turnover, whichever is higher. For a mid-sized European enterprise with €500 million in annual revenue, 7% is €35 million. That is not a line item that fits in a quarterly infrastructure budget.

What does this have to do with backup? More than most people realise.

The EU AI Act’s data governance requirements for high-risk AI systems include: documentation of training data sources and characteristics, evidence that training data met quality criteria, traceability of the data pipeline from raw source to training-ready dataset, and the ability to demonstrate to an auditor that the documented data was actually used. These are difficult requirements to satisfy retroactively through documentation alone. They require an immutable, auditable record — capabilities commonly associated with properly configured backup, retention, and audit systems — particularly those supporting immutable storage and comprehensive logging.

In other words: backup infrastructure for AI training data is not just disaster recovery. For organizations subject to the EU AI Act, it is a compliance artifact. The same system that recovers from a storage failure also produces the evidence trail that satisfies a regulatory audit. The backup is doing two jobs simultaneously. Its cost should be evaluated against both.

GDPR compounds this. Training datasets that include personal data — which covers most enterprise AI applications — must be processed under conditions where the organization can demonstrate control, access governance, and the ability to honour data subject rights including erasure. An AI training dataset that cannot be located, versioned, or audited is a GDPR liability. Versioned backup infrastructure with retention controls, encryption, and auditable access management can significantly support GDPR governance and accountability requirements.

→ For the full regulatory breakdown and specific configuration requirements, see: EU AI Act, GDPR, and DORA: What OpenStack Operators Must Do Before August 2026.

What the Insurance Actuaries Would Say

Insurance companies price risk by multiplying probability by impact. It is worth applying this framework to AI infrastructure protection, because the numbers produce an interesting result.

Probability of a storage event significant enough to corrupt or destroy training data over a three-year period on a production OpenStack cluster: not zero. Storage failures, operator errors, ransomware, and configuration mistakes are documented categories of events with documented frequencies. Organizations operating production-scale infrastructure should assume that meaningful storage, operational, or recovery-impacting incidents are a matter of when rather than if over multi-year operational periods.

Impact of that event, using the four categories above: direct compute loss (quantified in the table above), reconstruction cost (weeks to months of data engineering work), opportunity cost (delayed product timeline), and compliance exposure (regulatory risk for regulated AI systems).

For a mid-sized enterprise running two or three significant AI training workloads per year on OpenStack, illustrative full-impact estimates for a major AI infrastructure recovery event may range from hundreds of thousands to more than a million euros depending on compute scale, personnel impact, operational delay, and regulatory exposure. That range is wide because it depends heavily on whether the affected model is a high-risk AI system under EU regulation (which dramatically increases the compliance tail).

The annual cost of a properly configured Storware deployment for an OpenStack AI platform is a fraction of the lower bound of that range.

The insurance mathematics are not close. This is not a marginal call. It is one of the clearest risk/cost ratios in enterprise IT, and it is systematically under-evaluated because the risk is invisible until it materialises.

How to Have the Budget Conversation

If you are an IT director or infrastructure lead who needs to justify AI backup investment to a CFO, here is the framework that works:

Do not lead with “we could lose data.” Lead with “what is our AI product’s revenue contribution, and what is the cost of that product being delayed by eight weeks?” Get a number. Then ask: “What are we spending on the infrastructure that could cause that delay?” Then ask: “What are we spending to protect it?”

The gap between the third answer and the second answer is usually the budget conversation you were trying to have. The protection cost is almost always a small fraction of the at-risk value — which means the argument for investment practically makes itself, once the right numbers are on the table.

The reason this conversation does not happen enough is that IT teams frame backup as an operational necessity (“we should have this”), rather than a risk management instrument (“here is the expected value of not having this, expressed in money”). CFOs are trained to respond to the latter framing. They may be somewhat immune to the former.

What “Adequate Protection” Actually Means for AI Infrastructure

Given the cost profile above, “adequate protection” for AI infrastructure on OpenStack has a specific shape. It is not the same as adequate protection for general enterprise workloads:

  • Checkpoint-aware backup cadence — short-cycle incremental backup of training checkpoint locations, not just nightly VM snapshots. If a 100-hour training run saves checkpoints every 4 hours, your backup recovery point should be within one checkpoint window.
  • Storage-layer coverage — Ceph RBD backup independent of VM backup. In many OpenStack AI environments, datasets reside primarily in Ceph-backed storage rather than solely inside VM disks. Protect it at the Ceph layer directly.
  • Immutable destinations — WORM-configured backup storage for production model artifacts and training datasets. Commonly implemented in regulated AI environments to support resilience, retention, and tamper-evidence requirements.
  • Air-gap protection — at least one backup copy that is not reachable from the production network. For AI infrastructure specifically, where the data density and value per gigabyte is high, many organizations adopt air-gapped or logically isolated backup strategies to improve ransomware resilience.
  • Tested recovery — not “we ran the backup job.” Tested means “we restored the checkpoint and the training job continued from that point.” These are different statements. Only the second one is evidence of actual recoverability.

Storware Backup and Recovery implements all of the above for OpenStack environments: agentless incremental backup mechanisms optimized for OpenStack environments, native Ceph RBD integration at the storage layer, WORM-immutable backup destinations, IsoLayer air-gap protection, and Recovery Plans with schedulable automated testing. It covers both the OpenStack VM layer and Kubernetes container workloads under a single policy framework.

→ For the technical architecture detail on implementing these protections, see: How to Protect GPU Workloads on OpenStack: Backup Architecture for AI Training Infrastructure.

Run the Numbers on Your Environment

The calculation in this article is generic. The calculation for your environment is specific — it depends on your GPU cluster size, your training job durations, your dataset construction investment, and your regulatory classification under EU AI Act.

A 30-minute technical consultation with a Storware architect will produce a specific protection architecture for your OpenStack AI platform and a cost-benefit analysis that uses your actual numbers rather than industry averages.

Book the consultation →

Or start a 60-day free trial and see what is and is not protected in your current environment before the numbers become real.

text written by:

Paweł Piskorz, Presales Engineer at Storware