Backup & Data Protection for AI Workloads on OpenStack | Complete Guide
Table of contents
- Why OpenStack Has Become the Infrastructure of Choice for Private AI
- What Makes AI Workload Data Protection Different
- Five Backup Requirements for AI Infrastructure on OpenStack
- How Storware Backup and Recovery Addresses Each Requirement
- Supported OpenStack Distributions
- Reference Architecture: Protecting an OpenStack AI Platform
- Compliance Framework: EU AI Act, GDPR, and DORA
- OpenStack Backup Solutions: Key Capability Comparison
- Go Deeper: Articles in This Series
- Frequently Asked Questions
- Protect Your AI Infrastructure — Before the Next Training Job Runs
There is a pattern I have seen play out more than once over the past two years. An enterprise moves its AI workloads off a public hyperscaler — driven by GDPR concerns, the cost of GPU time, or a desire to keep training data under its own jurisdiction. It lands on OpenStack, which is the right call. The private cloud infrastructure is solid. The GPU scheduling works. The training runs start. And then someone asks: how are we backing this up? Silence. Because no one thought about it.
AI infrastructure and backup infrastructure grew up in separate teams, with separate budgets, and separate problem owners. This guide exists to close that gap — technically and operationally. It covers why AI workloads on OpenStack have specific data protection requirements that standard backup approaches do not address, what a production-ready protection architecture looks like, and what you need in place before the first compliance audit arrives.
If you are running AI or ML workloads on OpenStack — or planning to — this is the reference you need before you learn these lessons the expensive way.
Why OpenStack Has Become the Infrastructure of Choice for Private AI
OpenStack is not new. What is new is the reason enterprises are choosing it. For most of its first decade, OpenStack was chosen for cost reasons or ideological commitment to open source. Today, it is being chosen for control.
AI workloads are, by definition, data-intensive. The training datasets that feed large language models, computer vision systems, and predictive analytics pipelines are some of the most strategically sensitive assets an organization holds. Handing them to a public cloud provider — even with contractual data residency commitments — creates a jurisdiction problem that no service-level agreement can fully resolve. The US CLOUD Act means any American-headquartered provider can be compelled to produce data stored on infrastructure it operates, regardless of where that data physically sits. For European enterprises subject to GDPR, DORA, and the approaching EU AI Act enforcement deadlines, this is not an abstract legal concern. It is a board-level risk.
OpenStack addresses this structurally. It is open source, runs on hardware you own, in data centers you control, and under jurisdictions you choose. Its APIs are mature and well documented.
It provides production-ready infrastructure capabilities for GPU-accelerated workloads, including PCI passthrough, SR-IOV, and support for NVIDIA vGPU and MIG configurations. Recent releases such as Caracal continue to improve GPU scheduling and workload mobility, including limited live migration scenarios depending on configuration.
Adoption is also visible at scale. The Dawn Supercomputer uses OpenStack as part of its cloud management layer. NVIDIA leverages OpenStack Swift for large-scale data ingestion pipelines. FPT Smart Cloud has built a large OpenStack-based environment for AI and HPC workloads. Meanwhile, the OpenInfra Foundation has established an AI Working Group to further accelerate development in this area.
This is the context in which backup strategy needs to be understood. OpenStack has become infrastructure for some of the most valuable compute workloads in the enterprise. The protection posture needs to match.
→ For a deeper analysis of the sovereign AI driver and what it means for infrastructure decisions, see our article: Why Enterprises Are Running AI on OpenStack Private Cloud — and Why It Changes Your Backup Strategy.
What Makes AI Workload Data Protection Different
The question I hear most often is: “We already back up our VMs. Why is this different?”
Because AI workloads are not general-purpose compute workloads. They combine several characteristics that standard backup architectures are not designed to handle well:
1. Training datasets are not application data
Standard application backup is designed around protecting the state of a running system: the database, the configuration, the user data. AI training datasets are something else. They are large, often immutable at rest but actively read during training, and they may be stored across Cinder volumes, Swift object storage, and Ceph RBD clusters simultaneously. Protecting them requires a strategy that spans storage layers, not just VM snapshots.
More importantly: a training dataset is often the product of months of data engineering work — cleaning, labeling, augmenting, versioning. Losing it is not like losing a database that can be restored from a transaction log. There is no transaction log for a labeled image dataset in most cases. You lose it, you rebuild it from scratch.
2. GPU compute time has a direct monetary cost
When a training job fails midway through a 72-hour run on a cluster of A100 instances, the cost is not just the lost output. It is the GPU time already consumed. High-end GPU nodes cost hundreds of dollars per hour to operate. A failed job that cannot be checkpointed and resumed represents burned capital with no recovery path. Backup architecture for AI infrastructure must include checkpoint protection — the ability to recover a training job’s intermediate state, not just the final model artifact.
3. The regulatory landscape for AI data is different
Standard enterprise compliance frameworks require you to protect business data from loss, unauthorized access, and ransomware. The EU AI Act adds requirements that are specific to AI systems: data lineage documentation, traceability of training data, auditability of model versions. For high-risk AI systems — those operating in regulated domains such as financial services, healthcare, critical infrastructure — there are specific documentation and retention requirements that translate directly into backup architecture decisions. You need to prove, to an auditor, that you know exactly what data trained your model, when, and that you can recover it.
4. AI infrastructure is heterogeneous
A production AI platform on OpenStack typically consists of multiple components that need to be protected together: Nova instances running training jobs, Cinder volumes holding datasets, Swift or Ceph object storage holding model artifacts, and — increasingly — Kubernetes clusters running inference workloads alongside the OpenStack VMs. A backup solution that protects VMs but not containers, or containers but not block storage, does not protect the workload. It protects a subset of it.
Five Backup Requirements for AI Infrastructure on OpenStack
Given the above, a complete data protection architecture for AI workloads on OpenStack needs to address five distinct requirements:
Requirement 1: Training dataset protection across storage layers
Training data lives in multiple places. Cinder block volumes attached to Nova instances during active training runs. Swift or S3-compatible object storage for long-term dataset archives. Ceph RBD clusters that may serve both block and object storage simultaneously. Your backup solution needs native integration with all of these — not a generic “mount and copy” approach, but direct integration with the storage APIs to capture consistent point-in-time states without interrupting running jobs.
Requirement 2: VM-level protection for GPU instances
Nova instances with GPU passthrough or vGPU configurations require special handling. Standard snapshot-based approaches – depending on configuration – may not correctly capture the state of GPU-attached VMs, and the large disk images typical of AI training environments (100GB+ QCOW2 images with model weights and intermediate outputs) demand efficient incremental backup via Change Block Tracking (CBT) to keep backup windows manageable. Backing up a 500GB image every night is not a strategy — it is a storage bill.
Requirement 3: Kubernetes and container workload protection
Modern AI platforms are not purely VM-based. Inference workloads, MLOps pipelines, and model-serving endpoints increasingly run in Kubernetes or OpenShift containers. A complete protection strategy covers both the OpenStack VM layer and the container layer with consistent policy management. Managing two separate backup solutions for two infrastructure layers doubles operational overhead and creates gaps at the boundary.
Requirement 4: Immutable, air-gapped backup destinations
AI training infrastructure is a high-value ransomware target. A training cluster with months of labeled data and trained model weights represents enormous value, which makes it an attractive encryption target. Backup copies that are accessible from the production network provide no protection if ransomware reaches the backup server. Air-gapped backup destinations — physically or logically isolated from the production environment — are a hard requirement for AI infrastructure at any meaningful scale.
Requirement 5: Compliance-ready audit trail and retention controls
For organizations subject to the EU AI Act, DORA, or sector-specific regulation (banking, healthcare, critical infrastructure), backup is not just operational insurance. It is a compliance artifact. You need configurable retention policies, WORM (Write Once Read Many) immutability for backup data, full audit logging of who accessed what and when, and the ability to demonstrate that backup data has not been tampered with. These are not nice-to-haves. They are audit checklist items.
How Storware Backup and Recovery Addresses Each Requirement
Storware Backup and Recovery is an enterprise-grade, agentless data protection platform that has supported OpenStack natively since 2019 — before most of the current wave of interest in OpenStack-as-AI-infrastructure. The architecture was designed for heterogeneous environments typical of OpenStack deployments, which is precisely what AI infrastructure on OpenStack represents.
Native OpenStack API integration — no agents on compute nodes
Storware uses the OpenStack API surface directly: Nova for instance management, Cinder for block volume backup, Glance for image handling, and native Ceph RBD integration for direct storage-layer protection. There are no agents deployed on hypervisor nodes or inside VMs — this matters for GPU instances, where agent overhead introduces latency and complexity that has no place in a high-throughput training environment.
Backup operations use the Libvirt strategy for direct KVM hypervisor interaction and the disk attachment method using Cinder volumes, providing full and incremental backups for both QCOW2 and RAW disk images. The native Ceph RBD integration uses snapshot differencing — meaning incremental backups capture only changed blocks, not the full disk image. For a 500GB training volume where 20GB changed in the last 24 hours, you back up 20GB. Not 500GB.
Change Block Tracking for large-volume AI datasets
CBT (Change Block Tracking) is the mechanism that makes daily incremental backup of large AI volumes operationally viable. Storware implements CBT natively for OpenStack environments, tracking block-level changes between backup cycles. Combined with synthetic full backup generation — which constructs a full backup image on the destination without transferring the full dataset again — this provides complete recoverability with a fraction of the storage and network overhead of traditional full backups.
Unified protection for VMs and Kubernetes containers
Storware protects both OpenStack instances and Kubernetes/OpenShift container workloads under a single policy framework. This is particularly relevant for organizations running hybrid AI architectures — for example, GPU Nova instances for training jobs and Kubernetes pods for model-serving inference endpoints. One platform, one console, one retention policy, one compliance report.
IsoLayer air-gap protection
IsoLayer is Storware’s air-gap backup mechanism. It provides a logically isolated backup destination that is not accessible from the production network during normal operations, significantly reducing the ransomware attack surface. For AI infrastructure — which represents concentrated, high-value data — air-gap protection is not optional. IsoLayer integrates with the existing backup destination stack without requiring additional hardware or complex network segmentation.
WORM immutability, AES encryption, and full audit logging
Immutable backup destinations prevent modification or deletion of backup data once written — a hard requirement for EU AI Act data lineage compliance and DORA operational resilience documentation. Storware supports WORM-configured backup destinations with configurable retention locks. All backup data can be encrypted at rest using AES. Access is governed by RBAC (Role-Based Access Control) with full audit logging of all administrative actions. Multi-factor authentication via Keycloak is supported natively.
OpenStack Horizon / Skyline integration
Storware provides an OpenStack Horizon plugin — with support for Skyline integration — meaning backup management is accessible directly from the OpenStack dashboard. For cloud platform teams who manage their infrastructure through Horizon, this eliminates the context-switch to a separate management console and enables self-service backup policy management by OpenStack project owners, without requiring them to access the Storware management portal directly.
→ For the technical architecture deep dive, including how to configure CBT and Ceph RBD backup for GPU workloads specifically, see: How to Protect GPU Workloads on OpenStack: Backup Architecture for AI Training Infrastructure.
Supported OpenStack Distributions
OpenStack is not a single product — it is a framework implemented by multiple distributions, each with different deployment tooling, support models, and release cadences. Storware supports the full range of production OpenStack distributions used in enterprise AI deployments:
| Distribution | Deployment Model | Notes |
|---|---|---|
| OpenStack (Vanilla / Upstream) | Self-managed / community | Full support including Caracal (2024.1) and Dalmatian (2024.2) releases |
| Red Hat OpenStack Platform (RHOSP) | Enterprise / supported | Native integration; commonly used in regulated enterprise environments |
| Canonical OpenStack (Charmed OpenStack) | Enterprise / Ubuntu-based | Ubuntu 22.04 through 2025.1; KVM hypervisor and Ceph RBD native integration |
| Virtuozzo Hybrid Infrastructure (VHI) | Managed / OpenStack-compatible | Supported as OpenStack-compatible platform |
| OpenMetal | Hosted private cloud | Recommended as preferred backup solution by OpenMetal community |
| Platform9 Private Cloud Director | Managed / SaaS-delivered OpenStack | Supported as of Storware 7.5 |
| Sardina FishOS | Managed / OpenStack-compatible | Supported as OpenStack-compatible platform |
Storware’s single-license model covers all supported OpenStack distributions. There is no separate SKU for Red Hat OpenStack versus Canonical OpenStack versus vanilla upstream. One license covers your entire OpenStack footprint, regardless of which distribution it runs.
Reference Architecture: Protecting an OpenStack AI Platform
A production AI platform on OpenStack typically has four layers that require backup coverage. The following table maps each infrastructure layer to the corresponding Storware protection mechanism:
| Infrastructure Layer | What It Contains | Storware Protection Mechanism | Key Capability |
|---|---|---|---|
| Nova Instances (VMs) | Training environments, inference VMs, GPU-attached instances | Agentless VM backup via OpenStack API / Libvirt | CBT incremental, instant restore |
| Cinder Volumes | Training datasets, model checkpoints, intermediate outputs | Disk attachment method; native Ceph RBD snapshot differencing | Block-level incremental, no agent required |
| Object Storage (Swift / Ceph) | Model artifacts, dataset archives, experiment logs | Ceph RBD direct integration; S3-compatible destination support | Storage-layer protection without VM overhead |
| Kubernetes / OpenShift | Inference pods, MLOps pipelines, model-serving endpoints | Kubernetes/OpenShift container backup | Unified policy with VM layer; consistent retention |
The backup destination layer is independent and configurable. Options include local filesystem (XFS/NFS/ZFS), S3-compatible object storage (including Amazon S3, Impossible Cloud, Google Cloud Storage, Azure Blob), enterprise backup providers (IBM Spectrum Protect, Dell EMC Networker, Dell EMC Avamar), tape, and IsoLayer air-gap. Most AI deployments use a tiered approach: fast local storage for short-term recovery, object storage for long-term retention, and IsoLayer for ransomware-resilient copies.
Compliance Framework: EU AI Act, GDPR, and DORA
European enterprises running AI workloads on OpenStack face a converging set of regulatory requirements that directly affect backup architecture. Understanding which regulation imposes which specific requirement prevents the common mistake of designing a backup system that covers operational recovery but fails a compliance audit.
→ For the full regulatory analysis and specific configuration requirements, see: EU AI Act, GDPR, and DORA: What OpenStack Operators Must Do Before August 2026.
Regulatory requirements mapped to backup capabilities
| Regulation | Relevant Requirement | Backup Capability Required | Storware Feature |
|---|---|---|---|
| EU AI Act (enforcement Aug 2026) | Training data traceability and documentation for high-risk AI systems | Immutable backup with version history; data lineage support | WORM backup destinations, policy-based retention |
| GDPR | Data-at-rest protection; access controls; right to erasure compliance | AES encryption, RBAC, audit logging, configurable retention | AES encryption, RBAC, full audit log, retention policies |
| DORA (in force Jan 2025) | Tested failover; defined RPO/RTO; ICT risk register; third-party dependency documentation | Recovery plans with schedulable testing; RPO/RTO configuration | Recovery Plans, schedulable DR testing, SLA-based policy management |
| NIS2 | Incident reporting; business continuity measures for operators of essential services | Rapid recovery capability; SIEM integration for backup events | Instant restore, external SIEM support via audit log API |
The combination of WORM-immutable backup destinations, full audit logging, Keycloak MFA, and policy-based retention covers the core compliance requirements across all four frameworks. The key is configuration: these capabilities exist in Storware out of the box, but they need to be deliberately enabled and tested. Compliance is not a feature you turn on — it is a posture you demonstrate.
OpenStack Backup Solutions: Key Capability Comparison
Several solutions exist for OpenStack backup. The following comparison focuses on the capabilities most relevant to AI workload protection:
| Capability | Storware Backup and Recovery | Generic Snapshot Approach | Legacy Enterprise Backup |
|---|---|---|---|
| Agentless OpenStack backup | ✓ Native API integration | Partial (snapshot only) | Typically requires agents |
| Change Block Tracking (CBT) | ✓ Native | ✗ Full snapshots only | Varies by platform |
| Ceph RBD direct integration | ✓ Snapshot differencing | Partial | Rarely native |
| Kubernetes/OpenShift protection | ✓ Unified with VM policy | ✗ VM layer only | Separate product typically required |
| IsoLayer air-gap | ✓ Native | ✗ | Rarely native |
| WORM immutability | ✓ Configurable | ✗ | Varies |
| Horizon / Skyline plugin | ✓ Native | ✗ | ✗ |
| Multi-distribution OpenStack support | ✓ All major distributions | Distribution-specific | Limited |
| Single license for all sources | ✓ | N/A | Typically per-source pricing |
Go Deeper: Articles in This Series
This pillar page covers the full landscape of AI workload backup on OpenStack. Each article in the series goes deeper on a specific dimension:
- Why Enterprises Are Running AI on OpenStack — and Why It Changes Your Backup Strategy
For CIOs evaluating private AI infrastructure: the sovereign AI driver, EU AI Act implications, and why the infrastructure choice determines your compliance posture. - How to Protect GPU Workloads on OpenStack: Backup Architecture for AI Training Infrastructure
For IT architects: technical deep dive into Nova GPU instance backup, CBT configuration, Ceph RBD snapshot differencing, and checkpoint protection for long-running training jobs. - The Hidden Cost of Unprotected AI Infrastructure
For infrastructure leaders with budget responsibility: GPU hour loss calculations, training data reconstruction costs, and the full financial case for AI-specific backup investment. - Backing Up OpenStack + Kubernetes Hybrid AI/ML Infrastructure
For DevOps and cloud engineers: protecting VMs and containers under a single policy framework, handling the VM-container boundary in AI platforms, and operational patterns for hybrid infrastructure. - EU AI Act, GDPR, and DORA: What OpenStack Operators Must Do Before August 2026
For compliance officers and IT directors in regulated industries: specific configuration requirements for audit-readiness, data lineage, and operational resilience under EU regulation.
Frequently Asked Questions
What makes backing up AI workloads on OpenStack different from standard VM backup?
AI workloads introduce several differences that standard VM backup architectures are not designed for. First, the data assets are heterogeneous: training datasets may span Cinder block volumes, Ceph RBD storage, and Swift object storage simultaneously — protecting only the VM disk misses the majority of valuable data. Second, GPU-attached Nova instances require specific handling that snapshot-only approaches often do not cover correctly. Third, the regulatory requirements for AI data (EU AI Act traceability, DORA resilience testing) impose specific backup architecture requirements — WORM immutability, audit logging, configurable retention — that go beyond basic operational recovery.
Does Storware support all major OpenStack distributions?
Yes. Storware Backup and Recovery supports OpenStack vanilla (upstream), Red Hat OpenStack Platform, Canonical Charmed OpenStack (Ubuntu 22.04 through 2025.1 and compatible releases), Virtuozzo Hybrid Infrastructure, OpenMetal, Platform9 Private Cloud Director (as of v7.5), and Sardina FishOS, as well as other OpenStack-compatible distributions. A single Storware license covers all supported distributions — there is no per-distribution pricing.
Can Storware back up both OpenStack VMs and Kubernetes containers?
Yes. Storware protects OpenStack instances, Ceph RBD storage, and Kubernetes/OpenShift container workloads under a unified policy framework. For AI platforms where training runs on GPU Nova instances and inference runs in Kubernetes pods, this means a single backup platform, single retention policy, and single compliance reporting surface for both infrastructure layers.
What is IsoLayer and why does it matter for AI infrastructure?
IsoLayer is Storware’s air-gap backup protection mechanism. It provides a backup destination that is logically isolated from the production network, meaning ransomware that reaches the production environment cannot encrypt or destroy the backup copies. AI training environments are high-value ransomware targets — months of labeled training data and trained model weights concentrated in one place. Air-gap protection is the primary defense against backup destruction in a ransomware event.
How does Change Block Tracking work for large AI training volumes?
Change Block Tracking (CBT) records which blocks on a disk have changed since the last backup. Instead of backing up an entire 500GB training volume every night, CBT enables Storware to back up only the blocks that actually changed — which, for a dataset that is being read during training but not written, may be a small fraction of the total size. Combined with synthetic full backup generation (which reconstructs a full backup image on the destination without re-transferring unchanged data), CBT makes daily backup of large AI volumes operationally and economically viable.
What does DORA require from backup architecture for AI infrastructure?
DORA (Digital Operational Resilience Act), which entered full enforcement in January 2025, requires financial entities — and their critical ICT service providers — to maintain tested failover capabilities, defined and documented Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and accurate registers of ICT dependencies. For organizations running AI workloads in financial services contexts, this means every GPU cluster, training pipeline, and inference endpoint must be backed up against documented SLAs, with regular recovery tests that produce audit evidence. Storware’s Recovery Plans feature supports schedulable DR testing, and the audit log provides the evidence trail DORA requires.
How does Storware handle backup for OpenStack environments with Ceph storage?
Storware integrates directly with Ceph RBD at the storage layer, using snapshot differencing to capture block-level changes without requiring agent installation on Ceph nodes or VM guests. This approach is significantly more efficient than image-level backup for large volumes typical in AI environments. The native Ceph integration also supports the Libvirt backup strategy for KVM hypervisors, providing two complementary protection paths depending on the specific workload and recovery requirements.
Is Storware a member of the OpenInfra Foundation?
Yes. Storware is a member of the OpenInfra Foundation, the organization that stewards the OpenStack project and the broader open infrastructure ecosystem. This membership reflects both our technical commitment to the OpenStack community and our position as a native OpenStack data protection partner — not a solution that was adapted for OpenStack after the fact.
Protect Your AI Infrastructure — Before the Next Training Job Runs
The time to design your backup architecture for AI workloads on OpenStack is not after a production incident. GPU compute time does not come back. Labeled training datasets do not rebuild themselves. And EU AI Act audits do not wait for you to finish your compliance roadmap.
If you are running AI workloads on OpenStack today — or planning to — the right conversation starts with understanding your specific infrastructure topology and what it will take to protect it. That conversation takes 30 minutes and produces an architecture recommendation, not a sales pitch.
Book a technical consultation with a Storware architect →
Or start a 60-day free trial and connect Storware to your OpenStack environment today. No agents to install. No production impact. Results within hours.
