en

How to Back Up GPU Workloads on OpenStack: AI Training Infrastructure Guide

Let me tell you about a failure mode I have seen more than once.

An infrastructure team sets up a beautiful OpenStack cluster for AI workloads. GPUs, Ceph, the whole stack. Training jobs start running. Models start improving. Everyone is happy. Then, six weeks in, someone asks the backup question — and the answer turns out to be: “We have OpenStack snapshots.”

OpenStack snapshots are not backup. I will explain why in a moment. But the more interesting question is why so many otherwise competent engineering teams make this mistake. The answer is that GPU compute infrastructure was historically the domain of HPC teams who thought about job scheduling and utilization, not data protection. And backup infrastructure was the domain of storage teams who thought about VMware and SQL Server, not QCOW2 disk images with 300GB model checkpoints.

AI infrastructure on OpenStack sits exactly at the intersection of these two worlds, and falls into the gap between them. This article is about closing that gap — with a specific focus on what makes GPU workload protection technically different, and what a production-ready backup architecture actually looks like.

First: Why OpenStack Snapshots Are Not Backup

This point deserves its own section because the misconception is pervasive and the consequences are serious.

OpenStack supports snapshot creation for both Nova instances (via Glance) and Cinder volumes. Snapshots are fast, they are native to the platform, and they create a point-in-time copy of the resource. It is very easy to look at this capability and conclude: “We have backup.”

You do not. Here is why:

Snapshots live in the same failure domain as the source data. A Nova instance snapshot is stored in Glance, which lives on your OpenStack storage infrastructure. If your Ceph cluster loses quorum — which can happen — both the running instance and its snapshots disappear together. A snapshot is a copy; it is not a backup unless it is stored independently of the system it protects.

Snapshots capture disk state, not application state. A snapshot of a running training VM is a crash-consistent image — the equivalent of pulling the power cord and copying the disk. For a training job with open file handles, in-flight writes to a checkpoint file, and a PyTorch process mid-epoch, the resulting snapshot may be completely unrecoverable. Application-consistent backup requires coordination: quiescing writes, flushing buffers, then taking the snapshot. OpenStack does not do this for you automatically.

Snapshots are not retained, versioned, or lifecycle-managed. Backup requires retention policies: keep daily for 30 days, keep weekly for 12 weeks, keep monthly for a year. Snapshots accumulate indefinitely unless someone manages them manually — which no one does, until the storage is full. I have seen production environments with thousands of orphaned snapshots consuming terabytes of capacity that the team did not know existed.

Snapshots do not cover object storage. If your training datasets live in Ceph RBD or Swift — which they should for performance reasons — a Nova instance snapshot does not capture them. At all. You may successfully restore your training VM and discover that the 400GB dataset it was training on is gone.

Use snapshots for rapid rollback of specific changes. Use backup for actual data protection. These are different tools for different problems.

What Makes GPU Instances Different to Back Up

Backing up a GPU-attached Nova instance is not the same as backing up a standard VM. The differences are meaningful enough to warrant a dedicated section.

Disk image sizes are large — often very large

A standard enterprise VM might have a 50-80GB OS disk. A GPU training instance typically carries significantly more: the base OS, CUDA toolkit, deep learning framework (PyTorch, TensorFlow, JAX), model weights from previous runs, dataset subsets for the current job, and intermediate checkpoint files. It is not unusual to see GPU instance root volumes of 200-500GB, with attached Cinder data volumes that are larger still.

Backing up a 400GB disk image nightly with a full copy approach is not a strategy — it is an infrastructure bill and a backup window problem. The only viable approach at this scale is block-level incremental backup with Change Block Tracking (CBT). CBT records which blocks changed since the last backup cycle, and transfers only those blocks. For a 400GB training volume where the epoch completed and 15GB of checkpoint files changed, you back up 15GB — not 400GB. This is not an optimization. At AI data scale, it is the difference between a backup system that works and one that falls over during the first backup window.

GPU passthrough and vGPU instances require careful handling

PCI passthrough and vGPU configurations attach physical or virtual GPU devices to Nova instances in ways that snapshot-based approaches can struggle with. The GPU device state is not part of the VM disk image — it exists in hardware registers and driver memory. This means a backup taken during active GPU computation captures everything except the in-flight GPU state, which is fine for recovery purposes (GPU jobs are restartable) but means you need to understand exactly what your backup captures and what it does not.

The practical implication: backup GPU instances between training jobs, not during them. A training job is a stateless compute operation from a backup perspective — the valuable state is the checkpoint files and the dataset, not the in-flight computation. Design your backup schedule around job boundaries, not arbitrary time windows.

Here is the slightly counterintuitive insight: the GPU instance itself is often not your most critical backup target. The CUDA environment and the deep learning framework can be rebuilt from a container image or an Ansible playbook. What cannot be rebuilt cheaply is the training dataset and the model checkpoints. If you are prioritizing backup capacity, prioritize Cinder volumes and Ceph storage over OS disk images.

Large QCOW2 images and the synthetic full backup advantage

Storware generates synthetic full backup images on the destination side — meaning it constructs a complete, recoverable full backup from a previous full backup plus incremental changes, without transferring the full disk image again. For large QCOW2 images, this is significant: you get the recoverability of a full backup with the network transfer cost of an incremental. This matters most when your backup window is constrained by network bandwidth, which is almost always the case for AI training infrastructure with large dataset volumes.

Ceph RBD: The Most Important Thing You Are Probably Not Backing Up

Here is an unpopular opinion: most organizations running AI on OpenStack are not adequately protecting their Ceph storage. They are backing up VMs. They have forgotten that their training data lives somewhere else.

Ceph RBD (RADOS Block Device) is the dominant storage backend for production OpenStack deployments. It provides the block storage that Cinder volumes are built on, and often doubles as the object storage backend for Swift. In an AI training environment, Ceph RBD is where your data actually lives: the labeled datasets, the preprocessing outputs, the experiment artifacts, the model checkpoints between training runs.

The good news is that Ceph has excellent native snapshot capabilities at the RBD layer. The problem is that most backup tools treat Ceph as an opaque block device and back it up through the VM layer — mounting the Cinder volume to the VM, reading it through the guest OS, transferring the data over the network. This is the worst possible approach for large AI datasets. You are traversing three additional layers (hypervisor, guest OS, network) for data that could be read directly from the Ceph cluster.

Storware integrates directly with Ceph RBD at the storage layer, using RBD snapshot differencing. This means: take an RBD snapshot, compare it to the previous snapshot at the block level, transfer only the changed extents to the backup destination. No VM layer. No guest OS involvement. No agent. For a 2TB Ceph pool backing AI training data, the difference between storage-layer backup and VM-mediated backup is the difference between hours and days for the initial full backup, and minutes versus hours for incremental jobs.

The practical recommendation: treat Ceph RBD backup as a first-class backup job in its own right, separate from and in addition to VM-level backup. Your VM backup captures the compute environment. Your Ceph backup captures the data. You need both.

Model Checkpoints: The Asset Nobody Thinks to Protect

Training datasets are the obvious valuable asset. But model checkpoints are arguably more valuable, and they are systematically under-protected.

A model checkpoint is the serialized state of a neural network at a specific point in training: the weights, the optimizer state, the learning rate schedule. For a large model trained for days or weeks, the checkpoint represents hundreds or thousands of GPU-hours of irreversible computation. Losing the final checkpoint — or losing the intermediate checkpoints that let you resume a failed run — is not a storage event. It is a budget event.

The specific risk is more subtle than total loss. Training jobs save checkpoints incrementally: every N epochs, write a checkpoint file to a Cinder volume or object storage path. If that path is on a Ceph pool that is not in your backup scope, and your Ceph cluster has a corruption event, you lose checkpoints without losing the VM. Your backup system reports success — the VM was backed up — and your checkpoints are gone.

The protection architecture for checkpoints should include: short-cycle backup of the checkpoint storage location (hourly or per-epoch if the job duration justifies it), object storage mirroring for final model artifacts, and WORM-immutable storage for production model versions that have been deployed to inference. The last point matters for EU AI Act compliance: if a model version is in production serving decisions that affect people, the training data and the model weights need to be auditable and tamper-evident. Immutable storage is the technical implementation of that requirement.

Kubernetes: The Other Half of Your AI Platform

A common architecture pattern for enterprise AI on OpenStack looks like this: training runs on GPU Nova instances, inference runs in Kubernetes or OpenShift containers. The training infrastructure is OpenStack. The serving infrastructure is Kubernetes. The boundary between them is an artifact repository — a model registry, an S3-compatible bucket, a Helm chart that describes the inference deployment.

Most backup strategies cover the OpenStack half. They forget the Kubernetes half. This is understandable — Kubernetes backup is a separate discipline with its own complexity (PersistentVolumeClaims, namespaces, Helm releases, ConfigMaps, secrets). But it creates a specific recovery failure: you can restore your training environment perfectly and discover that your inference infrastructure — which was serving production traffic — cannot be recovered because nobody was backing it up.

The practical argument for unified VM and container backup under a single platform is not about elegance. It is about what happens during an actual incident. When something goes wrong at 2 AM and you are trying to restore a production AI platform, having two separate backup tools with separate restoration procedures, separate credentials, and separate retention policies is a problem you do not want.

Storware covers both OpenStack instances and Kubernetes/OpenShift workloads under a single policy framework. This is one of those architectural decisions that seems like a preference until the first incident, at which point it looks like obvious engineering foresight.

Air-Gap Protection for AI Infrastructure: Not Optional

The standard advice for ransomware protection is the 3-2-1 rule: three copies of data, on two different media types, with one copy offsite. This is good advice for general enterprise data. For AI training infrastructure, it is not sufficient.

Here is why: AI training environments are among the highest-value ransomware targets in the enterprise. A labeled training dataset represents months of data engineering work, potentially millions of dollars in data acquisition and annotation costs. A trained model weights file represents hundreds of GPU-hours that cannot be recovered from a backup if the backup itself is encrypted.

Ransomware groups are aware of this. Modern ransomware does not just encrypt the primary data — it specifically targets backup infrastructure. It scans the network for backup repositories, backup agents, and backup management consoles, and encrypts or corrupts those first. A backup that is accessible from the production network provides no protection against a ransomware variant that reaches the backup server before encrypting the primary data.

Air-gap protection — backup destinations that are logically or physically isolated from the production network during normal operations — is the only reliable defense against this attack pattern. IsoLayer, Storware’s air-gap mechanism, provides this isolation without requiring physical media rotation or complex network segmentation. The backup destination is accessible to the backup system during backup windows, and inaccessible to everything else during normal operations. A ransomware process that compromises the production OpenStack cluster cannot reach the backup data.

For AI infrastructure specifically, the air-gap question should not be “do we need this?” It should be “what is the acceptable recovery point if ransomware reaches our training data?” If the answer is “we need to recover the last 24 hours of training checkpoints and the full dataset,” you need an air-gapped backup that was written in the last 24 hours. Work backward from that requirement to the backup schedule.

A Practical Backup Configuration for OpenStack AI Infrastructure

Rather than abstract principles, here is a concrete starting point for backup configuration on a production AI platform. Adjust the frequency and retention based on the cost of your GPU compute and the regulatory requirements of your specific environment:

Target Backup Type Frequency Retention Destination
GPU Nova instances (OS + frameworks) Agentless VM, incremental CBT Daily 14 days daily, 3 months weekly Local fast storage + object storage
Cinder training data volumes Ceph RBD snapshot diff Daily + after major preprocessing 30 days daily, 6 months weekly Object storage (S3-compatible)
Active model checkpoints Ceph RBD / Cinder incremental Every 4-6 hours or per-epoch 7 days continuous Local fast storage + IsoLayer air-gap
Production model artifacts Object storage backup On deploy + daily WORM-immutable, compliance-defined IsoLayer air-gap + tape (long-term)
Kubernetes inference workloads Container backup (namespaces, PVCs) Daily 14 days daily Object storage

 

A note on the checkpoint frequency: the right interval depends on your GPU cost per hour and your tolerance for restarting training jobs. If you are running A100 clusters at €8/hour per GPU and your training jobs take 48 hours, losing a 20-hour checkpoint is not a storage event — it is a €160+ compute bill per GPU, multiplied by cluster size. Price the backup frequency against the recovery cost, not the storage cost.

The Thing Nobody Tests: Recovery

Backup without tested recovery is a liability, not an asset. This sounds obvious. It is apparently not obvious enough, because the most common gap I see in backup architectures for AI infrastructure is exactly this: regular backup jobs that have never been restored.

For AI infrastructure, recovery testing has a specific shape that differs from general VM recovery. The test is not just “can we restore this VM?” The test is “can we resume training from this checkpoint, and does the resumed run produce consistent results with the pre-failure run?” The second question requires actually running the training job against the restored data, which is more expensive and more operationally complex than restoring a web server and checking that the home page loads.

Storware supports schedulable recovery plan execution — automated DR testing that runs against a defined recovery procedure and produces audit evidence of the result. For organizations subject to DORA, this is not optional functionality: DORA requires tested failover with documented results. For everyone else, it is just good engineering practice.

Test your recovery. Actually run the training job. The one time you discover that your checkpoint backup was always crash-inconsistent is not the time you want to be discovering it.

Start With Your Current Environment

The architecture described here is not exotic or expensive to implement. It requires the right tooling, the right configuration, and — frankly — the decision to treat AI data with the same protection priority you would give to a production database.

If you want to understand what this looks like against your specific OpenStack deployment — distribution, storage topology, GPU configuration, Kubernetes integration — a 30-minute technical conversation with a Storware architect will get you to a concrete architecture recommendation, not a generic sales deck.

Book a technical consultation →

Or start the 60-day trial: storware.eu. Connect Storware to your OpenStack environment and see what is and is not protected — within hours of deployment.

→ For the strategic context on why enterprises are moving AI to OpenStack private cloud in the first place, see: Why Enterprises Are Running AI on OpenStack Private Cloud — and Why It Changes Your Backup Strategy.

→ For the financial case — GPU hours, dataset reconstruction costs, and the full ROI of AI-specific backup investment — see: The Hidden Cost of Unprotected AI Infrastructure.

text written by:

Paweł Piskorz, Presales Engineer at Storware