Complete Guide to Kubernetes Backup on OpenStack: StatefulSets, Persistent Volumes, and Production-Ready DR Strategies
Table of contents
- Part 1: Understanding Kubernetes Storage on OpenStack
- Part 2: The Multi-Layer Backup Challenge
- Part 3: etcd Backup Best Practices
- Part 4: Persistent Volume Backup Strategies
- Part 5: Application-Aware Backup for Databases
- Part 6: Open-Source Solutions – Velero Integration
- Part 7: Enterprise Backup with Storware
- Part 8: Production DR Planning
- Part 9: Troubleshooting Common Backup Failures
- Conclusion: Building a Production-Ready Backup Strategy

Part 1: Understanding Kubernetes Storage on OpenStack
StatefulSets vs Stateless Workloads
Before diving into backup strategies, you need to understand what you’re protecting. Kubernetes workloads fall into two categories:
Stateless applications can be destroyed and recreated without data loss. A web frontend serving static content, an API gateway, or a load balancer are stateless—if they disappear, you redeploy them from images, and they’re back online. For stateless workloads, backing up your deployment YAML files and container images is sufficient.
Stateful applications require persistent data that survives pod restarts and rescheduling. Databases, message queues, caching layers, and any application maintaining local state need careful backup strategies. StatefulSets manage these workloads, providing:
- Stable network identities (predictable DNS names)
- Ordered deployment and scaling
- Persistent storage that follows pods during rescheduling
- Stable pod identifiers that persist across updates
Consider a PostgreSQL database deployed as a StatefulSet with three replicas. Each pod (postgres-0, postgres-1, postgres-2) maintains its own persistent volume. If postgres-1 fails and Kubernetes reschedules it to a different node, the same persistent volume must reattach, preserving data integrity.
Persistent Volumes, Claims, and Storage Classes
Kubernetes abstracts storage through three components:
PersistentVolume (PV) represents actual storage—a Cinder volume in OpenStack, an NFS share, or local disk. PVs exist independently of pods and persist after pod deletion.
PersistentVolumeClaim (PVC) is a request for storage. Applications specify their storage needs through PVCs (size, access modes, performance characteristics). Kubernetes binds PVCs to suitable PVs.
StorageClass defines storage types available in your cluster. On OpenStack, you might have:
cinder-ssdfor high-performance database workloadscinder-hddfor log archival and backupsceph-rbdfor shared storage scenarios
Critical backup implication: backing up PVs alone isn’t enough. You need to preserve:
- The PV definition (which Cinder volume ID)
- The PVC binding (which pod claimed which volume)
- StorageClass configurations
- Application state and consistency
A working backup example:
# StatefulSet with persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: cinder-ssd
resources:
requests:
storage: 100Gi
This creates three PostgreSQL pods, each with dedicated 100GB Cinder SSD volumes. Your backup strategy must account for all three volumes plus the database consistency across replicas.
OpenStack Integration: Cinder CSI Driver
The Container Storage Interface (CSI) driver connects Kubernetes to OpenStack Cinder. When a pod requests storage through a PVC, the Cinder CSI driver:
- Calls OpenStack API to create a Cinder volume
- Attaches the volume to the compute node running the pod
- Mounts the volume into the container at the specified path
For backups, understanding CSI is crucial because it enables:
- Volume snapshots through CSI VolumeSnapshot API
- Volume cloning for testing and development
- Dynamic provisioning during restore operations
The Cinder CSI driver configuration affects your backup options:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cinder-csi
provisioner: cinder.csi.openstack.org
parameters:
availability: nova
type: ssd
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Part 2: The Multi-Layer Backup Challenge
What Traditional Backups Miss
Traditional VM backup approaches fail in Kubernetes environments for several reasons:
1. Cluster State Lives in etcd
All Kubernetes objects—deployments, services, ConfigMaps, Secrets, RBAC policies—exist as entries in etcd’s key-value store. Backing up worker node disks won’t capture this state. When disaster strikes, you might recover volumes but lose:
- Which deployment managed those volumes
- Network policies connecting services
- Ingress rules exposing applications
- Custom Resource Definitions (CRDs) for operators
2. Distributed Application State
Modern applications spread across multiple pods with complex dependencies. A typical microservices stack might include:
- Frontend pods (stateless, 3 replicas)
- API gateway (stateless, 2 replicas)
- Application pods (stateless, 5 replicas)
- PostgreSQL (StatefulSet, 3 replicas)
- Redis cache (StatefulSet, 3 replicas)
- Message queue (StatefulSet, 3 replicas)
Backing up just the database volumes ignores the application topology. During recovery, you need to recreate the entire service mesh, including ConfigMaps with database connection strings, Secrets with credentials, and Services providing stable endpoints.
3. Application Consistency Requirements
File-level backups of database volumes often produce corrupted, unrecoverable data. Databases maintain in-memory buffers, write-ahead logs, and complex consistency mechanisms. A snapshot taken while PostgreSQL writes to multiple files simultaneously may capture half-committed transactions.
Application-aware backups require:
- Quiescing writes before snapshot
- Flushing buffers to disk
- Coordinating multi-volume consistency (for sharded databases)
- Verifying backup integrity
4. Network Configuration Preservation
Kubernetes networking is complex. Beyond basic pod-to-pod communication, production environments include:
- LoadBalancer services with floating IPs from OpenStack Neutron
- Ingress controllers with TLS certificates
- NetworkPolicies restricting traffic between namespaces
- Service mesh configurations (Istio, Linkerd)
- DNS records pointing to services
Disaster recovery isn’t complete until applications are reachable. Losing network configurations means hours of manual reconfiguration.
The Three-Tier Backup Approach
Comprehensive Kubernetes backup requires protecting three distinct layers:
| Tier | What It Protects | Frequency | Retention | Recovery Time |
|---|---|---|---|---|
| Tier 1: Control Plane | etcd cluster state | Every 6 hours minimum | 30 days recommended | 10-20 minutes |
| Tier 2: Persistent Volumes | Application data | Application-dependent | Based on compliance | Minutes to hours |
| Tier 3: Application-Aware | Consistent database dumps | With maintenance windows | Long-term archival | Application-specific |
Warning: Most organizations discover gaps when they attempt their first recovery drill. The database restores successfully, but:
- The StatefulSet manifest is lost (no way to recreate pods)
- Service definitions are missing (database isn’t exposed)
- ConfigMaps with connection settings are gone
- Network policies block restored application traffic
Part 3: etcd Backup Best Practices
Understanding etcd’s Role
etcd stores every Kubernetes object as a key-value pair. When you run kubectl get pods, Kubernetes queries etcd. When you create a deployment, Kubernetes writes to etcd. The entire cluster state—past and present—lives in etcd’s database.
Without etcd backup, you cannot recover:
- Deployment definitions
- Service configurations
- Namespace organization
- RBAC policies and service accounts
- Ingress rules
- Custom resources
- ConfigMaps and Secrets
Manual etcd Backup
For clusters deployed with kubeadm or Kubespray, etcd typically runs as static pods on control plane nodes. SSH into a control plane node and create a snapshot:
# Set etcd endpoints and certificates
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db
This creates a point-in-time snapshot. The snapshot file contains everything needed to rebuild your cluster state.
Automated etcd Backup
Production environments need automated, scheduled backups. Deploy a CronJob that creates etcd snapshots and uploads them to object storage:
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.9-0
command:
- /bin/sh
- -c
- |
BACKUP_FILE="etcd-backup-$(date +%Y%m%d-%H%M%S).db"
etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/$BACKUP_FILE
# Upload to Swift object storage
swift upload etcd-backups /backup/$BACKUP_FILE
# Clean up local copy
rm /backup/$BACKUP_FILE
# Retention: delete backups older than 30 days
swift list etcd-backups | \
awk -v d=$(date -d '30 days ago' +%Y%m%d) \
'$0 < "etcd-backup-"d' | \
xargs -I {} swift delete etcd-backups {}
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: Directory
- name: backup
emptyDir: {}
This CronJob:
- Runs every 6 hours on control plane nodes
- Creates timestamped snapshots
- Uploads to OpenStack Swift
- Maintains 30-day retention
- Cleans up local storage to prevent disk exhaustion
etcd Recovery Procedure
When disaster strikes, restore etcd before recreating any workloads:
# Stop all control plane components
systemctl stop kubelet
# Download latest backup from Swift
swift download etcd-backups etcd-backup-YYYYMMDD-HHMMSS.db
# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore etcd-backup-YYYYMMDD-HHMMSS.db \
--name=node1 \
--initial-cluster=node1=https://10.20.20.11:2380,node2=https://10.20.20.12:2380,node3=https://10.20.20.13:2380 \
--initial-advertise-peer-urls=https://10.20.20.11:2380 \
--data-dir=/var/lib/etcd-restore
# Replace existing etcd data
rm -rf /var/lib/etcd
mv /var/lib/etcd-restore /var/lib/etcd
# Restart kubelet and verify
systemctl start kubelet
kubectl get nodes
Recovery time: 10-20 minutes for etcd restore, depending on database size.
Critical limitation: etcd backup only restores cluster configuration. Persistent volume data is NOT included in etcd snapshots. Your application data remains on Cinder volumes, but the PV/PVC bindings that connect pods to volumes are restored.
Part 4: Persistent Volume Backup Strategies
Understanding Volume Snapshot Lifecycle
Kubernetes 1.17+ includes built-in volume snapshot support through the CSI driver. Snapshots create point-in-time copies of persistent volumes without disrupting running applications.
The snapshot process:
- VolumeSnapshotClass defines how to create snapshots (similar to StorageClass for volumes)
- VolumeSnapshot is a request to snapshot a specific PVC
- VolumeSnapshotContent represents the actual snapshot in underlying storage (Cinder snapshot)
Create a VolumeSnapshotClass for Cinder:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: cinder-snapclass
driver: cinder.csi.openstack.org
deletionPolicy: Delete
parameters:
force-create: "true"
Snapshot a PostgreSQL database volume:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snapshot-20250119
namespace: production
spec:
volumeSnapshotClassName: cinder-snapclass
source:
persistentVolumeClaimName: data-postgresql-0
This triggers the Cinder CSI driver to call OpenStack’s snapshot API. The snapshot captures the Cinder volume state at the moment of creation.
Automated Volume Snapshots with CronJob
Manual snapshots don’t scale. Automate snapshot creation for all StatefulSet PVCs:
apiVersion: batch/v1
kind: CronJob
metadata:
name: pvc-snapshot
namespace: backup-system
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
serviceAccountName: snapshot-controller
containers:
- name: create-snapshots
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
# Find all PVCs used by StatefulSets
for pvc in $(kubectl get pvc -A \
-o jsonpath='{range .items[?(@.metadata.ownerReferences[0].kind=="StatefulSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
NS=$(echo $pvc | cut -d/ -f1)
PVC_NAME=$(echo $pvc | cut -d/ -f2)
SNAPSHOT_NAME="$PVC_NAME-$(date +%Y%m%d-%H%M)"
cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: $SNAPSHOT_NAME
namespace: $NS
labels:
backup-type: automated
source-pvc: $PVC_NAME
spec:
volumeSnapshotClassName: cinder-snapclass
source:
persistentVolumeClaimName: $PVC_NAME
EOF
echo "Created snapshot: $NS/$SNAPSHOT_NAME"
done
# Retention: Delete snapshots older than 7 days
kubectl get volumesnapshot -A \
-o json | jq -r '.items[] |
select(.metadata.labels."backup-type" == "automated") |
select(.metadata.creationTimestamp < (now - 604800 | todate)) |
"\(.metadata.namespace) \(.metadata.name)"' | \
while read ns name; do
kubectl delete volumesnapshot -n $ns $name
echo "Deleted old snapshot: $ns/$name"
done
restartPolicy: OnFailure
This automation:
- Discovers all StatefulSet PVCs across all namespaces
- Creates daily snapshots with timestamp naming
- Implements 7-day retention (configurable)
- Labels snapshots for easy identification
Snapshot-Based Recovery
Restore a volume from snapshot by creating a new PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-postgresql-0-restored
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: cinder-ssd
dataSource:
name: postgres-snapshot-20250119
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
resources:
requests:
storage: 100Gi
Kubernetes creates a new Cinder volume from the snapshot and binds it to this PVC. Update your StatefulSet to use the restored PVC, or create a temporary pod for data verification.
Recovery time: 5-15 minutes for typical database volumes (100GB). Larger volumes take proportionally longer.
Snapshot limitations:
- Snapshots live in the same Cinder backend as source volumes (vulnerable to storage backend failure)
- Cannot snapshot across availability zones
- No built-in compression or deduplication
- Snapshot chains can impact performance
Part 5: Application-Aware Backup for Databases
Why Application Awareness Matters
Volume snapshots capture raw block data. For databases, this creates problems:
Dirty buffers: PostgreSQL, MySQL, and MongoDB maintain in-memory write buffers. A snapshot might capture half-written transactions, producing a corrupted backup.
Consistency across volumes: Sharded MongoDB clusters store data across multiple volumes. A snapshot that captures shard-1 at 02:00:05 and shard-2 at 02:00:12 creates inconsistent data.
Transaction logs: Databases use write-ahead logs (WAL). Snapshots might capture database files without corresponding WAL files, preventing recovery.
Application-aware backups solve these problems by integrating with database backup tools that ensure consistency.
PostgreSQL Backup Integration
Deploy a sidecar container alongside PostgreSQL that performs logical backups using pg_dump:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
spec:
serviceName: postgres
replicas: 3
template:
spec:
containers:
# Main PostgreSQL container
- name: postgres
image: postgres:14
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: backup
mountPath: /backup
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
# Backup sidecar container
- name: backup
image: postgres:14
command:
- /bin/bash
- -c
- |
while true; do
# Wait until 2 AM
current_hour=$(date +%H)
if [ "$current_hour" -eq "2" ]; then
BACKUP_FILE="postgres-$(hostname)-$(date +%Y%m%d-%H%M%S).sql.gz"
# Perform logical backup
PGPASSWORD=$POSTGRES_PASSWORD pg_dump \
-h localhost \
-U postgres \
-F c \
-Z 9 \
-f /backup/$BACKUP_FILE \
postgres
# Upload to Swift
swift upload postgres-backups /backup/$BACKUP_FILE
# Clean up local copy
rm /backup/$BACKUP_FILE
echo "Backup completed: $BACKUP_FILE"
# Sleep until next day
sleep 82800 # 23 hours
else
sleep 3600 # Check every hour
fi
done
volumeMounts:
- name: backup
mountPath: /backup
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumes:
- name: backup
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: cinder-ssd
resources:
requests:
storage: 100Gi
This approach:
- Runs pg_dump inside the pod (no network overhead)
- Compresses backups before upload (saves storage costs)
- Uploads to object storage (survives OpenStack failures)
- Maintains consistency (pg_dump ensures transactional integrity)
MySQL Backup with Percona XtraBackup
For MySQL/MariaDB, Percona XtraBackup provides hot backups without locking tables:
apiVersion: batch/v1
kind: CronJob
metadata:
name: mysql-backup
namespace: production
spec:
schedule: "0 3 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: xtrabackup
image: percona/percona-xtrabackup:8.0
command:
- /bin/bash
- -c
- |
BACKUP_DIR="/backup/mysql-$(date +%Y%m%d-%H%M%S)"
# Perform hot backup
xtrabackup \
--backup \
--host=mysql-0.mysql \
--user=backup \
--password=$MYSQL_PASSWORD \
--target-dir=$BACKUP_DIR
# Prepare backup (apply logs)
xtrabackup --prepare --target-dir=$BACKUP_DIR
# Compress and upload
tar czf $BACKUP_DIR.tar.gz $BACKUP_DIR
swift upload mysql-backups $BACKUP_DIR.tar.gz
# Cleanup
rm -rf $BACKUP_DIR $BACKUP_DIR.tar.gz
env:
- name: MYSQL_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: backup-password
volumeMounts:
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup
emptyDir: {}
XtraBackup advantages:
- Non-blocking backups (no table locks)
- Point-in-time recovery with binary logs
- Faster than mysqldump for large databases
- Incremental backup support
MongoDB Replica Set Backup
For MongoDB, backup from a secondary replica to avoid impacting primary performance:
apiVersion: batch/v1
kind: CronJob
metadata:
name: mongodb-backup
namespace: production
spec:
schedule: "0 4 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: mongodump
image: mongo:6.0
command:
- /bin/bash
- -c
- |
BACKUP_FILE="mongodb-$(date +%Y%m%d-%H%M%S)"
# Backup from secondary replica
mongodump \
--host=mongodb-1.mongodb:27017 \
--username=backup \
--password=$MONGO_PASSWORD \
--authenticationDatabase=admin \
--gzip \
--archive=/backup/$BACKUP_FILE.archive
# Upload to Swift
swift upload mongodb-backups /backup/$BACKUP_FILE.archive
# Cleanup
rm /backup/$BACKUP_FILE.archive
env:
- name: MONGO_PASSWORD
valueFrom:
secretKeyRef:
name: mongodb-secret
key: backup-password
volumeMounts:
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup
emptyDir: {}
Part 6: Open-Source Solutions – Velero Integration
Why Velero?
Velero (formerly Heptio Ark) is the de-facto standard for Kubernetes backup. It provides:
- Cluster resource backup (deployments, services, ConfigMaps)
- Persistent volume snapshots through CSI
- Scheduled backups with retention policies
- Disaster recovery to different clusters
- Namespace-level backup and restore
Velero complements etcd backups by capturing resource definitions plus PV data in a coordinated manner.
Installing Velero on OpenStack
Deploy Velero with OpenStack Swift as backup storage:
# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Create Swift credentials file
cat > credentials-velero <<EOF
[default]
aws_access_key_id=$OS_USERNAME
aws_secret_access_key=$OS_PASSWORD
EOF
# Install Velero with Swift configuration
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--backup-location-config \
region=default,s3ForcePathStyle="true",s3Url=https://swift.openstack.example.com \
--snapshot-location-config region=default \
--use-node-agent
Configure VolumeSnapshotClass for Velero:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: velero-snapshot-class
labels:
velero.io/csi-volumesnapshot-class: "true"
driver: cinder.csi.openstack.org
deletionPolicy: Retain
Creating Scheduled Backups
Backup all resources in the production namespace daily:
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-daily
namespace: velero
spec:
schedule: "0 1 * * *" # 1 AM daily
template:
includedNamespaces:
- production
includeClusterResources: true
snapshotVolumes: true
ttl: 720h # 30 days retention
storageLocation: default
volumeSnapshotLocations:
- default
hooks:
resources:
- name: postgres-backup-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgresql
pre:
- exec:
command:
- /bin/bash
- -c
- PGPASSWORD=$POSTGRES_PASSWORD pg_dump postgres > /backup/pre-velero.sql
onError: Fail
timeout: 10m
This schedule:
- Backs up production namespace at 1 AM
- Includes all cluster-scoped resources
- Creates volume snapshots
- Retains backups for 30 days
- Runs pre-backup hooks on PostgreSQL pods
Disaster Recovery with Velero
Restore the entire production namespace to a different cluster:
# On the DR cluster, install Velero pointing to same Swift bucket
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config \
region=default,s3ForcePathStyle="true",s3Url=https://swift.openstack.example.com
# List available backups
velero backup get
# Restore latest production backup
velero restore create \
--from-backup production-daily-20250119010000 \
--wait
# Monitor restore progress
velero restore describe production-daily-20250119010000
Recovery time: 15-30 minutes for typical namespace with multiple StatefulSets.
Velero handles:
- Recreating all Kubernetes resources in correct order
- Restoring PVs from snapshots
- Remapping PVC bindings
- Preserving namespace configurations
Part 7: Enterprise Backup with Storware
Beyond Open-Source Solutions
While etcd backups, volume snapshots, and Velero provide foundational protection, production environments need:
- Cross-platform recovery: Restore Kubernetes workloads to different OpenStack clusters or public clouds
- Application discovery: Automatically identify and protect stateful applications
- Compliance reporting: Demonstrate backup success for audits
- Performance at scale: Handle hundreds of namespaces with thousands of PVs
- Granular recovery: Restore individual files from database volumes
- Change tracking: Understand what changed between backups
This is where enterprise solutions like Storware Backup and Recovery differentiate themselves.
Storware’s OpenStack-Native Approach
Storware integrates directly with OpenStack APIs, providing:
Complete infrastructure backup: Beyond Kubernetes resources, Storware captures:
- Nova instance metadata (flavor, key pairs, security groups)
- Neutron network topology (subnets, routers, floating IPs)
- Cinder volume configurations
- Glance image dependencies
- Keystone project settings
Disaster Recovery Orchestration
Storware’s DR capabilities extend beyond simple restore:
Automated failover: When primary OpenStack cluster fails:
- Storware detects failure (health check timeout)
- Spins up replacement VMs on DR cluster
- Attaches restored Cinder volumes
- Reconfigures Neutron networking (preserves IPs where possible)
- Updates DNS records
- Validates application health
- Sends notification
Non-disruptive DR testing: Test disaster recovery without impacting production:
- Restore to isolated network segment
- Validate data integrity
- Run application smoke tests
- Measure actual RTO/RPO
- Tear down test environment
- Generate compliance report
Granular recovery options:
- Full cluster restore
- Namespace-level restore
- Individual StatefulSet restore
- Single PV restore
- File-level restore from database volume
Cost Optimization
Enterprise backup generates massive storage costs. Storware reduces expenses through:
Global deduplication: Across all backups, Storware identifies duplicate data blocks. PostgreSQL base tables that rarely change aren’t backed up repeatedly—only changed blocks transfer.
Intelligent compression: Application-aware compression adapts to data type:
- Database backups: SQL-optimized compression
- Log files: Text compression
- Binary data: Skip compression (already compressed)
Tiered storage: Automatically move old backups to cheaper storage:
- Recent backups (< 7 days): High-performance Swift
- Medium-age backups (7-90 days): Standard Swift
- Long-term retention (> 90 days): Glacier-equivalent cold storage
Bandwidth optimization: Only incremental changes transfer after initial full backup. For a 10TB database with 1% daily change rate, daily backups transfer 100GB instead of 10TB.
When Storware Makes Sense
Consider enterprise backup solutions when:
- Managing 10+ Kubernetes clusters across multiple OpenStack regions
- Compliance requirements mandate immutable backups with audit trails
- RTO targets below 30 minutes require automated orchestration
- Multi-cloud DR strategy includes AWS, Azure, or GCP as failover targets
- Team lacks deep Kubernetes expertise to build custom backup automation
- Cost of downtime exceeds cost of licensing
Part 8: Production DR Planning
Defining RTO and RPO
Before implementing any backup strategy, establish your recovery objectives:
Recovery Time Objective (RTO): Maximum tolerable downtime
- Tier 1 applications (customer-facing): 15-30 minutes
- Tier 2 applications (internal tools): 2-4 hours
- Tier 3 applications (reporting, analytics): 24 hours
Recovery Point Objective (RPO): Maximum acceptable data loss
- Financial transactions: Seconds (requires transaction log shipping)
- Customer data: Minutes (requires frequent incremental backups)
- Analytics data: Hours to days (daily backups acceptable)
Map applications to tiers:
| Application | Tier | RTO | RPO | Backup Strategy |
|---|---|---|---|---|
| Payment API | 1 | 15 min | 1 min | Velero hourly + transaction logs |
| Web Frontend | 2 | 1 hour | 1 hour | Velero + Git deployment tracking |
| PostgreSQL | 1 | 15 min | 5 min | pg_dump + WAL archiving + Velero |
| Redis Cache | 2 | 30 min | 15 min | RDB snapshots + Velero |
| Elasticsearch | 2 | 2 hours | 1 hour | Snapshot API + Velero |
| Batch Jobs | 3 | 24 hours | 24 hours | etcd backup (stateless) |
Multi-Region DR Architecture
Production-grade DR requires geographic separation:
Primary Region (Region-A):
├─ Kubernetes Cluster A1 (production)
│ ├─ Control Plane: 3 nodes
│ ├─ Workers: 10 nodes
│ └─ Storage: Ceph cluster (3 OSDs)
├─ Backup Infrastructure
│ ├─ etcd backups → Swift (Region-A)
│ ├─ Volume snapshots → Cinder (Region-A)
│ └─ Velero backups → Swift (Region-A)
└─ Replication Jobs
└─ Copy to Region-B (hourly)
DR Region (Region-B):
├─ Kubernetes Cluster B1 (standby)
│ ├─ Control Plane: 3 nodes (minimal)
│ ├─ Workers: 3 nodes (scaled up during DR)
│ └─ Storage: Ceph cluster (1 OSD, expandable)
└─ Backup Storage
├─ Swift (Region-B) - replica of Region-A backups
└─ S3 (AWS) - offsite copy for catastrophic failures
Implement automated replication:
#!/bin/bash
# Replicate Swift backups to DR region
# Source region Swift
export OS_AUTH_URL_SRC=https://region-a.openstack.example.com:5000/v3
export OS_USERNAME_SRC=backup-user
export OS_PASSWORD_SRC=secret
# DR region Swift
export OS_AUTH_URL_DST=https://region-b.openstack.example.com:5000/v3
export OS_USERNAME_DST=backup-user
export OS_PASSWORD_DST=secret
# List backups created in last 24 hours
swift list --prefix etcd-backup- etcd-backups | \
while read backup; do
# Check if already replicated
OS_AUTH_URL=$OS_AUTH_URL_DST \
OS_USERNAME=$OS_USERNAME_DST \
OS_PASSWORD=$OS_PASSWORD_DST \
swift stat etcd-backups "$backup" &>/dev/null
if [ $? -ne 0 ]; then
# Download from source
OS_AUTH_URL=$OS_AUTH_URL_SRC \
OS_USERNAME=$OS_USERNAME_SRC \
OS_PASSWORD=$OS_PASSWORD_SRC \
swift download etcd-backups "$backup"
# Upload to destination
OS_AUTH_URL=$OS_AUTH_URL_DST \
OS_USERNAME=$OS_USERNAME_DST \
OS_PASSWORD=$OS_PASSWORD_DST \
swift upload etcd-backups "$backup"
# Clean up
rm "$backup"
echo "Replicated: $backup"
fi
done
DR Testing Runbook
Test disaster recovery quarterly to validate procedures and measure actual RTO:
Phase 1: Environment Validation (T+0 to T+10min)
# Verify DR cluster accessibility
kubectl --context=dr-cluster get nodes
# Check backup availability
velero backup get
swift list etcd-backups | tail -n 5
# Validate storage capacity
kubectl --context=dr-cluster get pv
Phase 2: Control Plane Recovery (T+10min to T+20min)
# Restore latest etcd backup
latest_backup=$(swift list etcd-backups | tail -n 1)
swift download etcd-backups $latest_backup
# Perform etcd restore on DR cluster
ETCDCTL_API=3 etcdctl snapshot restore $latest_backup \
--data-dir=/var/lib/etcd-dr
# Verify cluster state
kubectl get namespaces
kubectl get pods -A
Phase 3: Workload Recovery (T+20min to T+40min)
# Restore production namespace with Velero
velero restore create dr-test-$(date +%Y%m%d) \
--from-backup production-daily-latest \
--namespace-mappings production:production-dr \
--wait
# Monitor restore progress
watch kubectl get pods -n production-dr
# Verify StatefulSets
kubectl get statefulsets -n production-dr
kubectl get pvc -n production-dr
Phase 4: Application Validation (T+40min to T+50min)
# Check database connectivity
kubectl exec -n production-dr postgresql-0 -- psql -c "\l"
# Verify application health
kubectl exec -n production-dr api-0 -- curl -f http://localhost:8080/health
# Test external connectivity
curl -f https://dr-api.example.com/health
Phase 5: Documentation (T+50min to T+60min)
- Record actual RTO achieved
- Note any issues encountered
- Update runbook with lessons learned
- Generate compliance report
Backup Validation Strategy
Backups are worthless if they can’t restore. Implement automated validation:
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-validator
namespace: backup-system
spec:
schedule: "0 6 * * 0" # Weekly Sunday 6 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: validator
image: custom/backup-validator:latest
command:
- /bin/bash
- -c
- |
# Get latest Velero backup
BACKUP=$(velero backup get -o json | \
jq -r '.items | sort_by(.status.completionTimestamp) | .[-1].metadata.name')
# Create test namespace
kubectl create namespace backup-test-$(date +%s)
# Restore to test namespace
velero restore create test-restore-$(date +%s) \
--from-backup $BACKUP \
--namespace-mappings production:backup-test-$(date +%s) \
--wait
# Validate database restore
kubectl exec -n backup-test-* postgresql-0 -- \
psql -c "SELECT COUNT(*) FROM users;" | \
grep -q "^[0-9]\+$"
if [ $? -eq 0 ]; then
echo "SUCCESS: Backup validation passed"
# Send success notification
curl -X POST https://monitoring.example.com/alert \
-d "status=success&backup=$BACKUP"
else
echo "FAILURE: Backup validation failed"
# Send failure alert
curl -X POST https://monitoring.example.com/alert \
-d "status=failure&backup=$BACKUP"
fi
# Cleanup test namespace
kubectl delete namespace backup-test-*
restartPolicy: OnFailure
This validator:
- Restores latest backup to isolated namespace
- Verifies database integrity
- Tests application functionality
- Alerts on validation failures
- Cleans up test resources
Part 9: Troubleshooting Common Backup Failures
etcd Snapshot Failures
Symptom: CronJob fails with “connection refused” error
Diagnosis:
# Check etcd pod status
kubectl get pods -n kube-system | grep etcd
# Verify certificates
ls -la /etc/kubernetes/pki/etcd/
# Test etcdctl connectivity
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list
Solution: Update CronJob with correct certificate paths and ensure hostNetwork access.
Volume Snapshot Stuck in Pending
Symptom: VolumeSnapshot remains in “Pending” state indefinitely
Diagnosis:
# Check snapshot status
kubectl describe volumesnapshot postgres-snapshot-20250119
# Check CSI driver logs
kubectl logs -n kube-system csi-cinder-controllerplugin-*
# Verify VolumeSnapshotClass
kubectl get volumesnapshotclass
Common causes:
- VolumeSnapshotClass not defined
- CSI driver doesn’t support snapshots
- Insufficient Cinder quota
- OpenStack credentials expired
Solution:
# Create VolumeSnapshotClass
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: cinder-snapclass
driver: cinder.csi.openstack.org
deletionPolicy: Delete
EOF
Velero Backup Incomplete
Symptom: Velero backup shows “PartiallyFailed” status
Diagnosis:
# Get detailed backup information
velero backup describe production-daily-20250119010000
# Check backup logs
velero backup logs production-daily-20250119010000
Common issues:
- PVs without snapshot support (local volumes)
- Pods stuck in CrashLoopBackOff
- Pre-backup hooks timeout
- Insufficient Swift storage quota
Solution: Exclude problematic resources:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: production-backup
spec:
excludedResources:
- events
- events.events.k8s.io
- backups.velero.io
- restores.velero.io
labelSelector:
matchExpressions:
- key: velero.io/exclude-from-backup
operator: NotIn
values:
- "true"
Out of Space During Restore
Symptom: PVC creation fails with “exceeded quota” during restore
Diagnosis:
# Check Cinder quota
openstack quota show
# Check PV usage
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,STATUS:.status.phase
Solution: Increase OpenStack quota or clean up unused PVs:
# Delete unused PVs
kubectl get pv | grep Released | awk '{print $1}' | xargs kubectl delete pv
# Request quota increase
openstack quota set --volumes 100 --gigabytes 10000 PROJECT_ID
Conclusion: Building a Production-Ready Backup Strategy
Comprehensive Kubernetes backup on OpenStack requires coordinated protection across three layers:
- Control plane backup: etcd snapshots every 6 hours, retained for 30 days, replicated to DR region
- Volume-level backup: CSI snapshots for fast recovery, application-aware dumps for long-term retention
- Full workload backup: Velero or enterprise solutions for complete disaster recovery
No single tool solves all backup challenges. Successful strategies combine:
- etcd backups for cluster state
- Volume snapshots for rapid recovery
- Application dumps for point-in-time consistency
- Velero for coordinated cluster-wide backups
- Enterprise solutions (like Storware) for complex multi-cluster environments
Your backup strategy should match your RTO/RPO requirements:
- Aggressive RTO (<30min): Automated failover with warm standby clusters
- Standard RTO (2-4hr): Velero + manual failover procedures
- Relaxed RTO (24hr): etcd + volume snapshots + documented recovery
Most importantly: test your backups. Quarterly DR drills expose gaps in procedures, validate RTOs, and build team muscle memory. The backup strategy that sits untested is the backup strategy that fails when you need it most.
For production OpenStack environments running business-critical Kubernetes workloads, investing in robust backup automation pays dividends. Whether you build with open-source tools or adopt enterprise solutions, the key is coverage, automation, and validation.
