Backing Up OpenStack + Kubernetes Hybrid AI Infrastructure
Table of contents
- How Production AI Platforms on OpenStack Actually Look
- The Boundary Problem: Where Things Actually Go Wrong
- Two Tools Is Not a Strategy
- Four Patterns Worth Implementing
- What “Unified” Actually Means in Practice
- Write the Runbook Before the Incident. Actually Write It.
- The Next Level: Backup Policy as Infrastructure Code
It is 2:17 AM. Your on-call phone goes off. Storage node down, Ceph cluster degraded, three training jobs aborted. You SSH in, assess the damage, and start the recovery playbook.
- Step one: restore the training VMs.
- Step two: restore the Kubernetes inference deployments that were writing predictions to a volume on the same Ceph pool.
- Step three: figure out why your Kubernetes backup tool does not have credentials to the new namespace you deployed last week.
- Step four: start a Slack thread with the ML team asking which checkpoint was the last clean one, because apparently nobody documented that either.
- Step five: promise yourself that this will never happen again.
It will happen again. Not because your team is careless, but because hybrid AI infrastructure on OpenStack was assembled by people solving different problems at different times with different tools, and nobody sat down to think about what happens when it all needs to come back at once. The OpenStack team owns the VMs. The platform team owns Kubernetes. The data team owns the Ceph storage that both depend on. And backup — the thing that touches all three — belongs to whoever is on call when something breaks.
This article is about designing that away before the next 2 AM call, not during it.
How Production AI Platforms on OpenStack Actually Look
Before talking about backup architecture, it helps to be honest about the actual topology — not the architecture diagram that lives in Confluence, but the thing that actually runs in production after two years of organic growth.
The OpenStack layer hosts long-running, compute-intensive work: GPU Nova instances for training runs, large-memory instances for data preprocessing, Cinder volumes for active training datasets and checkpoint files. This layer is managed by the infrastructure team. It changes slowly. Instances live for days or weeks. The GPU scheduling is Nova’s domain and the infrastructure team understands it well.
The Kubernetes layer sits on top, managed by the platform team — sometimes the same team, often not. It runs the fast-moving work: Kubeflow pipelines, KServe inference endpoints, MLflow experiment tracking, Jupyter notebook servers, model registry services. Pods spin up and down, deployments roll out multiple times per day, and the operational model is declarative GitOps that the platform team is proud of and the infrastructure team regards with mild suspicion.
The Ceph layer is shared infrastructure that both teams depend on and neither team fully owns. The OpenStack layer uses it through Cinder block volumes. The Kubernetes layer uses it through PersistentVolumeClaims backed by Ceph RBD CSI. The storage team — if there is a storage team — manages it. If there is no storage team, it is managed by whoever set it up and has been managing it since by virtue of knowing the most about it.
Here is the uncomfortable observation: in this three-layer, three-team architecture, there is almost never a single person who understands the complete data flow from a GPU Nova instance through shared Ceph storage into a Kubernetes inference endpoint. And there is almost never a backup strategy that was designed for the system as a whole, rather than for each layer independently.
That is not a criticism. It is how software systems grow. It is also exactly why the 2 AM call goes the way it does.
The Boundary Problem: Where Things Actually Go Wrong
In hybrid OpenStack + Kubernetes AI platforms, backup failures concentrate at the boundaries between layers. Not in the middle of OpenStack — that part usually works. Not in the middle of Kubernetes — that part usually works too. At the transitions. Here are the specific failures I have seen, with enough frequency that they deserve names:
The PVC-Cinder mismatch
A Kubernetes PersistentVolumeClaim backed by Ceph RBD is, at the storage layer, may map to either a Cinder volume or a direct RBD image. Your VM backup tool sees it as an OpenStack resource. Your Kubernetes backup tool sees it as a PVC in a namespace. Neither tool has the full picture, and typically operates independently.
The failure scenario: you restore the training Nova instance from backup. The instance comes up. The Ceph volumes are there. The model weights are there. Then you realise the MLflow tracking database — which runs as a Kubernetes Deployment with a PVC — was backed up separately, three hours earlier, by a different tool, against a different schedule. Your experiment metadata from the last three hours is gone. The model weights you just recovered reference experiments that MLflow has never heard of. You have two internally consistent backups that are mutually incoherent. Neither is wrong. Together they are useless.
I am not inventing this scenario. It is the kind of thing that makes experienced engineers go very quiet for a few seconds when you describe it to them, because they have either seen it or they have just realised they are one incident away from it.
The namespace drift problem
Kubernetes backup tools authenticate against the cluster and enumerate namespaces to protect. This list is captured at configuration time. When a new namespace is deployed — a new model version, a new team’s experiment environment, a new inference endpoint — it does not automatically enter the backup scope. Whether new namespaces are protected by default depends on your tooling and configuration.
The pattern: a team deploys a new production inference namespace on a Thursday. Someone means to update the backup configuration. They get pulled into a sprint review. Monday comes. The namespace has been running in production since Thursday with no backup. On Tuesday, something goes wrong. The namespace is not in backup scope. The backup success dashboard has been green all weekend — the tool was successfully backing up everything it knew about. It just did not know about the new namespace.
The insidious part is that green backup metrics feel like safety. They are not safety. They are evidence that the system is successfully protecting whatever it was told to protect, which may or may not be what actually matters.
The shared-secret cross-contamination
Kubernetes Secrets and ConfigMaps in an AI platform often contain credentials that span both infrastructure layers: OpenStack application credentials for reading from Swift, Ceph cluster keys for direct RBD access, MLflow tracking server connection strings, model registry tokens. These cross-layer dependencies mean that restoring the Kubernetes namespace without the correct Secret state breaks connectivity to the OpenStack layer — even if the OpenStack layer is fully intact and correctly restored.
The recovery failure mode: you restore both layers from backup. Everything appears healthy. Then the model serving pod tries to write inference results to the Ceph-backed volume, and the Ceph credentials often stored in Kubernetes Secrets are from the version that existed before a key rotation that happened between the backup of the Secret and the backup of the Ceph cluster. The write fails with an authentication error. You now have a correctly restored system that cannot perform its primary function because two separately-managed backup systems captured their respective pieces at slightly different moments in time.
This is why the unit of recovery for a hybrid AI platform is not “restore the VM” or “restore the namespace.” It is “restore a consistent, coordinated state across all layers simultaneously.” Anything less is restoring components, not restoring the system.
Two Tools Is Not a Strategy
Here is the argument I hear against unified VM-and-container backup: “They have different mechanics. VM backup is block-level. Kubernetes backup is manifest and PVC state. You need different tools for different problems.”
This is technically accurate and operationally irrelevant.
The mechanics of how data is captured are an implementation detail. The operational reality is what matters, and the operational reality of running two separate backup tools for two layers of the same platform is this: when something goes wrong and you are recovering a system you have never actually tested recovering, you are now managing two separate credential sets, two separate alert channels sending you conflicting status messages, two separate retention policies that have probably diverged because someone updated one and forgot the other, two separate runbooks written by different people who made different assumptions about what “recovery complete” means, and two separate compliance reports that the auditor will notice are not aligned.
That is not double the protection. It is double the failure surface and half the coherence.
There is also an incentive problem that nobody talks about: when backup is owned by two different tools, it is often not clearly owned by anyone. The OpenStack backup is the infrastructure team’s problem. The Kubernetes backup is the platform team’s problem. The gap between them — the boundary where the three failure modes described above live — is nobody’s problem until it becomes everyone’s problem at 2 AM.
Storware covers OpenStack instances and Kubernetes/OpenShift workloads under a single policy framework. The technical operations are different per layer, as they should be. But policy definition, scheduling, retention, alerting, and recovery planning live in one place, with one data model, owned by one team with one runbook. The Ceph RBD integration operates at the storage layer directly, which means both the OpenStack Cinder volumes and the Kubernetes PVCs backed by the same Ceph cluster can be protected with coordinated timing — the consistency problem that two independent tools struggle to guarantee nonstop.
One tool. One runbook. One 2 AM recovery sequence. This is not elegance for its own sake. It is the operational difference between a recovery that works and one that almost works.
Four Patterns Worth Implementing
The patterns below are specific enough to be actionable. Each one came from observing what fails in production hybrid AI platforms — not from first principles.
Pattern 1: Event-driven backup triggers instead of time-based schedules
The most common backup failure in AI platforms is not storage corruption or infrastructure failure. It is schedule misalignment. A training job that was supposed to finish at midnight runs until 4 AM because the dataset was larger than expected. The backup job that was scheduled for 2 AM runs against an active training job and produces a crash-consistent snapshot of a GPU instance mid-epoch — which may look fine but may not restore to a fully usable training state.
The fix is event-driven backup triggers. When a training job completes — detected via MLflow run status, Kubeflow pipeline completion, or a post-job webhook — trigger the backup of the associated Cinder volumes and checkpoint locations. The backup runs against a quiesced state, not mid-computation. This requires a few lines of integration between your MLOps tooling and Storware’s REST API, and it produces recovery points you can actually use. A time-based schedule produces recovery points that look correct in the backup dashboard and fail during actual recovery.
Two lines of code in your post-job hook. The difference between a backup that exists and a backup that works.
Pattern 2: Policy-scoped auto-discovery instead of enumerated namespace lists
The namespace drift problem has one solution: stop enumerating. A backup policy that protects all namespaces matching prod-* and inference-* automatically protects new production namespaces when they are created. There is no manual update step. There is no “I meant to do that.” New targets matching the policy are protected from the moment they exist.
This is how infrastructure configuration works in a mature GitOps environment — you declare intent, the system discovers targets. Backup policy should work the same way. Enumerating specific namespaces in a backup configuration is the backup equivalent of hardcoding IP addresses. It works until it doesn’t, and when it stops working nobody notices immediately because the backup dashboard stays green.
Pattern 3: Coordinated Ceph snapshot timing for cross-layer consistency
When the OpenStack layer and the Kubernetes layer are both reading from and writing to shared Ceph RBD pools, backup consistency requires coordinated snapshot timing. If your OpenStack backup job runs at 02:00 and your Kubernetes backup job runs at 02:45, the two backups capture different moments in the life of the same underlying storage. For workloads where a Kubernetes model registry is writing references to Ceph-backed artifacts that a Nova training instance is simultaneously reading, a 45-minute gap between backups produces the kind of incoherence described in the failure modes above.
A unified backup platform that understands both layers can coordinate snapshot timing — or better, use a single Ceph RBD snapshot that both layers recover from, significantly reducing timing inconsistencies. This is architecturally important for AI platforms, not just convenient. The consistency guarantee that two independent tools cannot provide is the thing that makes a recovery actually work.
Pattern 4: Backup verification as a model promotion gate
This is the unconventional one. I am recommending it because I have seen the failure mode it prevents too many times.
In a mature MLOps workflow, model promotion follows a pipeline: experimental → staging → production. Each gate includes validation: accuracy metrics, bias checks, latency benchmarks. What almost no team includes as a gate is backup verification. The check that both the model artifacts and the inference deployment configuration are in backup scope and have a recent successful backup, before the model goes live.
The result is a recurring pattern: a model gets promoted to production in a hurry — because the business wants it live, because it passed its quality gates, because deployment is smooth. Three months later, something goes wrong. The model artifacts are not in backup scope. They were never in backup scope. Nobody noticed because the backup dashboard showed green — it was protecting everything it knew about, and this model was deployed before anyone added it to the policy.
Adding backup verification to the promotion gate takes one API call to Storware to check backup status for the relevant resources. It adds seconds to the deployment pipeline. It prevents the discovery, three months into production, that your most-used model has zero recovery options.
It is also — this is not the primary reason to do it, but worth noting — a useful signal for EU AI Act compliance: the model’s backup status at deployment time becomes part of its deployment metadata and audit trail. Compliance evidence that generates itself as a side effect of good engineering is the best kind.
What “Unified” Actually Means in Practice
A single table is worth more than three paragraphs of description here. For a production AI platform on OpenStack with a Kubernetes inference and MLOps layer, Storware covers the following under one management interface:
| Component | Layer | Protection Mechanism | Policy Scope |
|---|---|---|---|
| GPU Nova instances | OpenStack | Agentless, incremental backup mechanisms | VM policy |
| Cinder training volumes | OpenStack | Ceph RBD snapshot differencing — storage layer, no VM overhead | VM policy (attached volumes) |
| Inference service Deployments | Kubernetes | Kubernetes resource manifest capture | Container policy |
| Inference PVCs (Ceph RBD CSI) | Kubernetes | CSI snapshot + coordinated Ceph RBD backup | Container policy |
| MLflow tracking server | Kubernetes | Pod + PVC, application-consistent | Container policy |
| Kubeflow Pipelines | Kubernetes | Namespace-scoped resource backup | Container policy |
| Secrets and ConfigMaps | Kubernetes | Encrypted namespace backup — cross-layer credentials included | Container policy |
Notice the last row. Secrets and ConfigMaps are often the last thing teams think to include in backup scope and the first thing that breaks a recovery. Backing them up is not dramatic. Not having them backed up is.
The OpenStack Horizon plugin means backup status for Nova instances and volumes is visible directly from the OpenStack dashboard — the one the infrastructure team already has open. Not a separate tab. Not a separate tool. The same place where they manage everything else.
Write the Runbook Before the Incident. Actually Write It.
Here is an observation about incident response that most engineers know but rarely act on: the runbook you improvise during an incident is never as good as the one you wrote before it. The cognitive load of a real incident — the pressure, the ambiguous error messages, the Slack threads with eight people asking different questions simultaneously — makes systematic thinking genuinely harder. The runbook you write in a calm afternoon is work done by a clearer version of you.
For hybrid OpenStack + Kubernetes AI platforms, the recovery sequence has a specific order that matters and is easy to get wrong under pressure:
- Restore Ceph storage health and availability first — both layers depend on it, nothing else works until this is done
- Restore Kubernetes infrastructure state — cluster configuration, namespace definitions, RBAC — before application workloads
- Restore Secrets and ConfigMaps — cross-layer credentials must be in place before any workload tries to authenticate
- Restore Nova instances for the training environment
- Restore Cinder volumes — checkpoint volumes for aborted training jobs specifically
- Restore Kubernetes application workloads — inference deployments, MLOps services, PVCs
- Verify cross-layer consistency — model registry entries must match artifact locations in Ceph
- Run a smoke test training job against restored checkpoints before declaring recovery complete
Step 8 is the one that gets skipped. “Services are running” and “the platform is recovered” are not the same statement. A training job that runs correctly against restored checkpoints and produces results consistent with pre-incident state is evidence of recovery. Services being up is evidence that services are up, which is necessary but not sufficient.
Storware’s Recovery Plans let you define, document, and schedule automated testing of this sequence. The plan runs, the results are logged, the audit record is produced. This is how you verify the runbook works at 2 PM on a Tuesday rather than discovering it does not work at 2 AM on a Sunday.
The Next Level: Backup Policy as Infrastructure Code
For teams already operating in a GitOps model — infrastructure state declared in version-controlled manifests, applied through reviewed PRs, audited through commit history — treating backup policy as code is the natural extension.
Define backup policies as configuration that lives in the same repository as your Kubernetes manifests. Apply changes through the same PR review process. Audit backup policy changes through the same git log that tracks every other infrastructure change. When the question is “who changed the retention policy for the inference namespace and when,” the answer is in the commit history, not in a GUI audit log that nobody exports.
Backup policies that change through code review are harder to accidentally break, easier to audit, and automatically documented as a side effect of the review process. A policy change that went through a PR has a reviewer, a timestamp, a rationale, and a diff. A policy change made through a GUI at 11 PM during an incident has a timestamp, if you are lucky.
Storware’s REST API supports full programmatic management of backup policies, schedules, and retention configurations. The integration is straightforward if your team is already working this way. If you are not working this way yet — this is, incidentally, a good reason to start. Backup policy is infrastructure. Treat it like infrastructure.
→ For the financial case for this architecture investment, see: The Hidden Cost of Unprotected AI Infrastructure.
→ For the GPU-specific backup architecture including CBT and Ceph RBD detail, see: How to Protect GPU Workloads on OpenStack.
→ For the complete reference on OpenStack AI data protection, see the pillar page: Backup and Data Protection for AI Workloads on OpenStack: The Complete Guide.
If you want to talk through what this architecture looks like against your specific platform topology, book a 30-minute consultation. Bring the architecture diagram — even the messy real one, not the Confluence version.
