en

Complete Guide to Kubernetes Backup on OpenStack: StatefulSets, Persistent Volumes, and Production-Ready DR Strategies

Running Kubernetes on OpenStack combines the flexibility of open-source cloud infrastructure with the power of container orchestration. This combination offers significant cost advantages over managed cloud services while maintaining full control over your infrastructure. However, this control comes with responsibility—specifically, ensuring your stateful applications and critical data survive any disaster scenario.

 

 

The challenge with Kubernetes disaster recovery on OpenStack is complexity. Unlike managed Kubernetes services that offer integrated backup solutions, self-hosted K8s clusters require you to protect multiple layers: the cluster state stored in etcd, persistent volumes attached to pods, application data within containers, network configurations, and the OpenStack infrastructure itself.


Most organizations discover their backup strategy’s inadequacy only during a recovery scenario. A common pattern: teams back up etcd religiously, assuming they’ve protected their cluster, only to discover during recovery that all application data stored in persistent volumes is gone. Or they successfully restore volumes but lose critical network configurations, leaving applications unreachable. This guide provides a comprehensive, production-ready approach to Kubernetes backup and disaster recovery on OpenStack. We’ll cover everything from deployment patterns through complete DR strategies, with working examples and code you can implement immediately.

Part 1: Understanding Kubernetes Storage on OpenStack

StatefulSets vs Stateless Workloads

Before diving into backup strategies, you need to understand what you’re protecting. Kubernetes workloads fall into two categories:

Stateless applications can be destroyed and recreated without data loss. A web frontend serving static content, an API gateway, or a load balancer are stateless—if they disappear, you redeploy them from images, and they’re back online. For stateless workloads, backing up your deployment YAML files and container images is sufficient.

Stateful applications require persistent data that survives pod restarts and rescheduling. Databases, message queues, caching layers, and any application maintaining local state need careful backup strategies. StatefulSets manage these workloads, providing:

  • Stable network identities (predictable DNS names)
  • Ordered deployment and scaling
  • Persistent storage that follows pods during rescheduling
  • Stable pod identifiers that persist across updates

Consider a PostgreSQL database deployed as a StatefulSet with three replicas. Each pod (postgres-0, postgres-1, postgres-2) maintains its own persistent volume. If postgres-1 fails and Kubernetes reschedules it to a different node, the same persistent volume must reattach, preserving data integrity.

Persistent Volumes, Claims, and Storage Classes

Kubernetes abstracts storage through three components:

PersistentVolume (PV) represents actual storage—a Cinder volume in OpenStack, an NFS share, or local disk. PVs exist independently of pods and persist after pod deletion.

PersistentVolumeClaim (PVC) is a request for storage. Applications specify their storage needs through PVCs (size, access modes, performance characteristics). Kubernetes binds PVCs to suitable PVs.

StorageClass defines storage types available in your cluster. On OpenStack, you might have:

  • cinder-ssd for high-performance database workloads
  • cinder-hdd for log archival and backups
  • ceph-rbd for shared storage scenarios

Critical backup implication: backing up PVs alone isn’t enough. You need to preserve:

  • The PV definition (which Cinder volume ID)
  • The PVC binding (which pod claimed which volume)
  • StorageClass configurations
  • Application state and consistency

A working backup example:

# StatefulSet with persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: cinder-ssd
      resources:
        requests:
          storage: 100Gi

This creates three PostgreSQL pods, each with dedicated 100GB Cinder SSD volumes. Your backup strategy must account for all three volumes plus the database consistency across replicas.

OpenStack Integration: Cinder CSI Driver

The Container Storage Interface (CSI) driver connects Kubernetes to OpenStack Cinder. When a pod requests storage through a PVC, the Cinder CSI driver:

  1. Calls OpenStack API to create a Cinder volume
  2. Attaches the volume to the compute node running the pod
  3. Mounts the volume into the container at the specified path

For backups, understanding CSI is crucial because it enables:

  • Volume snapshots through CSI VolumeSnapshot API
  • Volume cloning for testing and development
  • Dynamic provisioning during restore operations

The Cinder CSI driver configuration affects your backup options:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cinder-csi
provisioner: cinder.csi.openstack.org
parameters:
  availability: nova
  type: ssd
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Part 2: The Multi-Layer Backup Challenge

What Traditional Backups Miss

Traditional VM backup approaches fail in Kubernetes environments for several reasons:

1. Cluster State Lives in etcd

All Kubernetes objects—deployments, services, ConfigMaps, Secrets, RBAC policies—exist as entries in etcd’s key-value store. Backing up worker node disks won’t capture this state. When disaster strikes, you might recover volumes but lose:

  • Which deployment managed those volumes
  • Network policies connecting services
  • Ingress rules exposing applications
  • Custom Resource Definitions (CRDs) for operators

2. Distributed Application State

Modern applications spread across multiple pods with complex dependencies. A typical microservices stack might include:

  • Frontend pods (stateless, 3 replicas)
  • API gateway (stateless, 2 replicas)
  • Application pods (stateless, 5 replicas)
  • PostgreSQL (StatefulSet, 3 replicas)
  • Redis cache (StatefulSet, 3 replicas)
  • Message queue (StatefulSet, 3 replicas)

Backing up just the database volumes ignores the application topology. During recovery, you need to recreate the entire service mesh, including ConfigMaps with database connection strings, Secrets with credentials, and Services providing stable endpoints.

3. Application Consistency Requirements

File-level backups of database volumes often produce corrupted, unrecoverable data. Databases maintain in-memory buffers, write-ahead logs, and complex consistency mechanisms. A snapshot taken while PostgreSQL writes to multiple files simultaneously may capture half-committed transactions.

Application-aware backups require:

  • Quiescing writes before snapshot
  • Flushing buffers to disk
  • Coordinating multi-volume consistency (for sharded databases)
  • Verifying backup integrity

4. Network Configuration Preservation

Kubernetes networking is complex. Beyond basic pod-to-pod communication, production environments include:

  • LoadBalancer services with floating IPs from OpenStack Neutron
  • Ingress controllers with TLS certificates
  • NetworkPolicies restricting traffic between namespaces
  • Service mesh configurations (Istio, Linkerd)
  • DNS records pointing to services

Disaster recovery isn’t complete until applications are reachable. Losing network configurations means hours of manual reconfiguration.

The Three-Tier Backup Approach

Comprehensive Kubernetes backup requires protecting three distinct layers:

Tier What It Protects Frequency Retention Recovery Time
Tier 1: Control Plane etcd cluster state Every 6 hours minimum 30 days recommended 10-20 minutes
Tier 2: Persistent Volumes Application data Application-dependent Based on compliance Minutes to hours
Tier 3: Application-Aware Consistent database dumps With maintenance windows Long-term archival Application-specific

 

Warning: Most organizations discover gaps when they attempt their first recovery drill. The database restores successfully, but:

  • The StatefulSet manifest is lost (no way to recreate pods)
  • Service definitions are missing (database isn’t exposed)
  • ConfigMaps with connection settings are gone
  • Network policies block restored application traffic

Part 3: etcd Backup Best Practices

Understanding etcd’s Role

etcd stores every Kubernetes object as a key-value pair. When you run kubectl get pods, Kubernetes queries etcd. When you create a deployment, Kubernetes writes to etcd. The entire cluster state—past and present—lives in etcd’s database.

Without etcd backup, you cannot recover:

  • Deployment definitions
  • Service configurations
  • Namespace organization
  • RBAC policies and service accounts
  • Ingress rules
  • Custom resources
  • ConfigMaps and Secrets

Manual etcd Backup

For clusters deployed with kubeadm or Kubespray, etcd typically runs as static pods on control plane nodes. SSH into a control plane node and create a snapshot:

# Set etcd endpoints and certificates
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db

This creates a point-in-time snapshot. The snapshot file contains everything needed to rebuild your cluster state.

Automated etcd Backup

Production environments need automated, scheduled backups. Deploy a CronJob that creates etcd snapshots and uploads them to object storage:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
          - name: backup
            image: registry.k8s.io/etcd:3.5.9-0
            command:
            - /bin/sh
            - -c
            - |
              BACKUP_FILE="etcd-backup-$(date +%Y%m%d-%H%M%S).db"
              etcdctl \
                --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/server.crt \
                --key=/etc/kubernetes/pki/etcd/server.key \
                snapshot save /backup/$BACKUP_FILE
              
              # Upload to Swift object storage
              swift upload etcd-backups /backup/$BACKUP_FILE
              
              # Clean up local copy
              rm /backup/$BACKUP_FILE
              
              # Retention: delete backups older than 30 days
              swift list etcd-backups | \
                awk -v d=$(date -d '30 days ago' +%Y%m%d) \
                '$0 < "etcd-backup-"d' | \
                xargs -I {} swift delete etcd-backups {}
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/control-plane
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: Directory
          - name: backup
            emptyDir: {}

This CronJob:

  • Runs every 6 hours on control plane nodes
  • Creates timestamped snapshots
  • Uploads to OpenStack Swift
  • Maintains 30-day retention
  • Cleans up local storage to prevent disk exhaustion

etcd Recovery Procedure

When disaster strikes, restore etcd before recreating any workloads:

# Stop all control plane components
systemctl stop kubelet

# Download latest backup from Swift
swift download etcd-backups etcd-backup-YYYYMMDD-HHMMSS.db

# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore etcd-backup-YYYYMMDD-HHMMSS.db \
  --name=node1 \
  --initial-cluster=node1=https://10.20.20.11:2380,node2=https://10.20.20.12:2380,node3=https://10.20.20.13:2380 \
  --initial-advertise-peer-urls=https://10.20.20.11:2380 \
  --data-dir=/var/lib/etcd-restore

# Replace existing etcd data
rm -rf /var/lib/etcd
mv /var/lib/etcd-restore /var/lib/etcd

# Restart kubelet and verify
systemctl start kubelet
kubectl get nodes

Recovery time: 10-20 minutes for etcd restore, depending on database size.

Critical limitation: etcd backup only restores cluster configuration. Persistent volume data is NOT included in etcd snapshots. Your application data remains on Cinder volumes, but the PV/PVC bindings that connect pods to volumes are restored.

Part 4: Persistent Volume Backup Strategies

Understanding Volume Snapshot Lifecycle

Kubernetes 1.17+ includes built-in volume snapshot support through the CSI driver. Snapshots create point-in-time copies of persistent volumes without disrupting running applications.

The snapshot process:

  1. VolumeSnapshotClass defines how to create snapshots (similar to StorageClass for volumes)
  2. VolumeSnapshot is a request to snapshot a specific PVC
  3. VolumeSnapshotContent represents the actual snapshot in underlying storage (Cinder snapshot)

Create a VolumeSnapshotClass for Cinder:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: cinder-snapclass
driver: cinder.csi.openstack.org
deletionPolicy: Delete
parameters:
  force-create: "true"

Snapshot a PostgreSQL database volume:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snapshot-20250119
  namespace: production
spec:
  volumeSnapshotClassName: cinder-snapclass
  source:
    persistentVolumeClaimName: data-postgresql-0

This triggers the Cinder CSI driver to call OpenStack’s snapshot API. The snapshot captures the Cinder volume state at the moment of creation.

Automated Volume Snapshots with CronJob

Manual snapshots don’t scale. Automate snapshot creation for all StatefulSet PVCs:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pvc-snapshot
  namespace: backup-system
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-controller
          containers:
          - name: create-snapshots
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              # Find all PVCs used by StatefulSets
              for pvc in $(kubectl get pvc -A \
                -o jsonpath='{range .items[?(@.metadata.ownerReferences[0].kind=="StatefulSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
                
                NS=$(echo $pvc | cut -d/ -f1)
                PVC_NAME=$(echo $pvc | cut -d/ -f2)
                SNAPSHOT_NAME="$PVC_NAME-$(date +%Y%m%d-%H%M)"
                
                cat <<EOF | kubectl apply -f -
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: $SNAPSHOT_NAME
                namespace: $NS
                labels:
                  backup-type: automated
                  source-pvc: $PVC_NAME
              spec:
                volumeSnapshotClassName: cinder-snapclass
                source:
                  persistentVolumeClaimName: $PVC_NAME
              EOF
                
                echo "Created snapshot: $NS/$SNAPSHOT_NAME"
              done
              
              # Retention: Delete snapshots older than 7 days
              kubectl get volumesnapshot -A \
                -o json | jq -r '.items[] | 
                  select(.metadata.labels."backup-type" == "automated") |
                  select(.metadata.creationTimestamp < (now - 604800 | todate)) |
                  "\(.metadata.namespace) \(.metadata.name)"' | \
                while read ns name; do
                  kubectl delete volumesnapshot -n $ns $name
                  echo "Deleted old snapshot: $ns/$name"
                done
          restartPolicy: OnFailure

This automation:

  • Discovers all StatefulSet PVCs across all namespaces
  • Creates daily snapshots with timestamp naming
  • Implements 7-day retention (configurable)
  • Labels snapshots for easy identification

Snapshot-Based Recovery

Restore a volume from snapshot by creating a new PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgresql-0-restored
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: cinder-ssd
  dataSource:
    name: postgres-snapshot-20250119
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  resources:
    requests:
      storage: 100Gi

Kubernetes creates a new Cinder volume from the snapshot and binds it to this PVC. Update your StatefulSet to use the restored PVC, or create a temporary pod for data verification.

Recovery time: 5-15 minutes for typical database volumes (100GB). Larger volumes take proportionally longer.

Snapshot limitations:

  • Snapshots live in the same Cinder backend as source volumes (vulnerable to storage backend failure)
  • Cannot snapshot across availability zones
  • No built-in compression or deduplication
  • Snapshot chains can impact performance

Part 5: Application-Aware Backup for Databases

Why Application Awareness Matters

Volume snapshots capture raw block data. For databases, this creates problems:

Dirty buffers: PostgreSQL, MySQL, and MongoDB maintain in-memory write buffers. A snapshot might capture half-written transactions, producing a corrupted backup.

Consistency across volumes: Sharded MongoDB clusters store data across multiple volumes. A snapshot that captures shard-1 at 02:00:05 and shard-2 at 02:00:12 creates inconsistent data.

Transaction logs: Databases use write-ahead logs (WAL). Snapshots might capture database files without corresponding WAL files, preventing recovery.

Application-aware backups solve these problems by integrating with database backup tools that ensure consistency.

PostgreSQL Backup Integration

Deploy a sidecar container alongside PostgreSQL that performs logical backups using pg_dump:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
spec:
  serviceName: postgres
  replicas: 3
  template:
    spec:
      containers:
      # Main PostgreSQL container
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: backup
          mountPath: /backup
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
      
      # Backup sidecar container
      - name: backup
        image: postgres:14
        command:
        - /bin/bash
        - -c
        - |
          while true; do
            # Wait until 2 AM
            current_hour=$(date +%H)
            if [ "$current_hour" -eq "2" ]; then
              BACKUP_FILE="postgres-$(hostname)-$(date +%Y%m%d-%H%M%S).sql.gz"
              
              # Perform logical backup
              PGPASSWORD=$POSTGRES_PASSWORD pg_dump \
                -h localhost \
                -U postgres \
                -F c \
                -Z 9 \
                -f /backup/$BACKUP_FILE \
                postgres
              
              # Upload to Swift
              swift upload postgres-backups /backup/$BACKUP_FILE
              
              # Clean up local copy
              rm /backup/$BACKUP_FILE
              
              echo "Backup completed: $BACKUP_FILE"
              
              # Sleep until next day
              sleep 82800  # 23 hours
            else
              sleep 3600  # Check every hour
            fi
          done
        volumeMounts:
        - name: backup
          mountPath: /backup
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
      
      volumes:
      - name: backup
        emptyDir: {}
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: cinder-ssd
      resources:
        requests:
          storage: 100Gi

This approach:

  • Runs pg_dump inside the pod (no network overhead)
  • Compresses backups before upload (saves storage costs)
  • Uploads to object storage (survives OpenStack failures)
  • Maintains consistency (pg_dump ensures transactional integrity)

MySQL Backup with Percona XtraBackup

For MySQL/MariaDB, Percona XtraBackup provides hot backups without locking tables:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mysql-backup
  namespace: production
spec:
  schedule: "0 3 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: xtrabackup
            image: percona/percona-xtrabackup:8.0
            command:
            - /bin/bash
            - -c
            - |
              BACKUP_DIR="/backup/mysql-$(date +%Y%m%d-%H%M%S)"
              
              # Perform hot backup
              xtrabackup \
                --backup \
                --host=mysql-0.mysql \
                --user=backup \
                --password=$MYSQL_PASSWORD \
                --target-dir=$BACKUP_DIR
              
              # Prepare backup (apply logs)
              xtrabackup --prepare --target-dir=$BACKUP_DIR
              
              # Compress and upload
              tar czf $BACKUP_DIR.tar.gz $BACKUP_DIR
              swift upload mysql-backups $BACKUP_DIR.tar.gz
              
              # Cleanup
              rm -rf $BACKUP_DIR $BACKUP_DIR.tar.gz
            env:
            - name: MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-secret
                  key: backup-password
            volumeMounts:
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup
            emptyDir: {}

XtraBackup advantages:

  • Non-blocking backups (no table locks)
  • Point-in-time recovery with binary logs
  • Faster than mysqldump for large databases
  • Incremental backup support

MongoDB Replica Set Backup

For MongoDB, backup from a secondary replica to avoid impacting primary performance:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mongodb-backup
  namespace: production
spec:
  schedule: "0 4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: mongodump
            image: mongo:6.0
            command:
            - /bin/bash
            - -c
            - |
              BACKUP_FILE="mongodb-$(date +%Y%m%d-%H%M%S)"
              
              # Backup from secondary replica
              mongodump \
                --host=mongodb-1.mongodb:27017 \
                --username=backup \
                --password=$MONGO_PASSWORD \
                --authenticationDatabase=admin \
                --gzip \
                --archive=/backup/$BACKUP_FILE.archive
              
              # Upload to Swift
              swift upload mongodb-backups /backup/$BACKUP_FILE.archive
              
              # Cleanup
              rm /backup/$BACKUP_FILE.archive
            env:
            - name: MONGO_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mongodb-secret
                  key: backup-password
            volumeMounts:
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup
            emptyDir: {}

Part 6: Open-Source Solutions – Velero Integration

Why Velero?

Velero (formerly Heptio Ark) is the de-facto standard for Kubernetes backup. It provides:

  • Cluster resource backup (deployments, services, ConfigMaps)
  • Persistent volume snapshots through CSI
  • Scheduled backups with retention policies
  • Disaster recovery to different clusters
  • Namespace-level backup and restore

Velero complements etcd backups by capturing resource definitions plus PV data in a coordinated manner.

Installing Velero on OpenStack

Deploy Velero with OpenStack Swift as backup storage:

# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Create Swift credentials file
cat > credentials-velero <<EOF
[default]
aws_access_key_id=$OS_USERNAME
aws_secret_access_key=$OS_PASSWORD
EOF

# Install Velero with Swift configuration
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --backup-location-config \
    region=default,s3ForcePathStyle="true",s3Url=https://swift.openstack.example.com \
  --snapshot-location-config region=default \
  --use-node-agent

Configure VolumeSnapshotClass for Velero:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: velero-snapshot-class
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: cinder.csi.openstack.org
deletionPolicy: Retain

Creating Scheduled Backups

Backup all resources in the production namespace daily:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-daily
  namespace: velero
spec:
  schedule: "0 1 * * *"  # 1 AM daily
  template:
    includedNamespaces:
    - production
    includeClusterResources: true
    snapshotVolumes: true
    ttl: 720h  # 30 days retention
    storageLocation: default
    volumeSnapshotLocations:
    - default
    hooks:
      resources:
      - name: postgres-backup-hook
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgresql
        pre:
        - exec:
            command:
            - /bin/bash
            - -c
            - PGPASSWORD=$POSTGRES_PASSWORD pg_dump postgres > /backup/pre-velero.sql
            onError: Fail
            timeout: 10m

This schedule:

  • Backs up production namespace at 1 AM
  • Includes all cluster-scoped resources
  • Creates volume snapshots
  • Retains backups for 30 days
  • Runs pre-backup hooks on PostgreSQL pods

Disaster Recovery with Velero

Restore the entire production namespace to a different cluster:

# On the DR cluster, install Velero pointing to same Swift bucket
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config \
    region=default,s3ForcePathStyle="true",s3Url=https://swift.openstack.example.com

# List available backups
velero backup get

# Restore latest production backup
velero restore create \
  --from-backup production-daily-20250119010000 \
  --wait

# Monitor restore progress
velero restore describe production-daily-20250119010000

Recovery time: 15-30 minutes for typical namespace with multiple StatefulSets.

Velero handles:

  • Recreating all Kubernetes resources in correct order
  • Restoring PVs from snapshots
  • Remapping PVC bindings
  • Preserving namespace configurations

Part 7: Enterprise Backup with Storware

Beyond Open-Source Solutions

While etcd backups, volume snapshots, and Velero provide foundational protection, production environments need:

  • Cross-platform recovery: Restore Kubernetes workloads to different OpenStack clusters or public clouds
  • Application discovery: Automatically identify and protect stateful applications
  • Compliance reporting: Demonstrate backup success for audits
  • Performance at scale: Handle hundreds of namespaces with thousands of PVs
  • Granular recovery: Restore individual files from database volumes
  • Change tracking: Understand what changed between backups

This is where enterprise solutions like Storware Backup and Recovery differentiate themselves.

Storware’s OpenStack-Native Approach

Storware integrates directly with OpenStack APIs, providing:

Complete infrastructure backup: Beyond Kubernetes resources, Storware captures:

  • Nova instance metadata (flavor, key pairs, security groups)
  • Neutron network topology (subnets, routers, floating IPs)
  • Cinder volume configurations
  • Glance image dependencies
  • Keystone project settings

Disaster Recovery Orchestration

Storware’s DR capabilities extend beyond simple restore:

Automated failover: When primary OpenStack cluster fails:

  1. Storware detects failure (health check timeout)
  2. Spins up replacement VMs on DR cluster
  3. Attaches restored Cinder volumes
  4. Reconfigures Neutron networking (preserves IPs where possible)
  5. Updates DNS records
  6. Validates application health
  7. Sends notification

Non-disruptive DR testing: Test disaster recovery without impacting production:

  • Restore to isolated network segment
  • Validate data integrity
  • Run application smoke tests
  • Measure actual RTO/RPO
  • Tear down test environment
  • Generate compliance report

Granular recovery options:

  • Full cluster restore
  • Namespace-level restore
  • Individual StatefulSet restore
  • Single PV restore
  • File-level restore from database volume

Cost Optimization

Enterprise backup generates massive storage costs. Storware reduces expenses through:

Global deduplication: Across all backups, Storware identifies duplicate data blocks. PostgreSQL base tables that rarely change aren’t backed up repeatedly—only changed blocks transfer.

Intelligent compression: Application-aware compression adapts to data type:

  • Database backups: SQL-optimized compression
  • Log files: Text compression
  • Binary data: Skip compression (already compressed)

Tiered storage: Automatically move old backups to cheaper storage:

  • Recent backups (< 7 days): High-performance Swift
  • Medium-age backups (7-90 days): Standard Swift
  • Long-term retention (> 90 days): Glacier-equivalent cold storage

Bandwidth optimization: Only incremental changes transfer after initial full backup. For a 10TB database with 1% daily change rate, daily backups transfer 100GB instead of 10TB.

When Storware Makes Sense

Consider enterprise backup solutions when:

  • Managing 10+ Kubernetes clusters across multiple OpenStack regions
  • Compliance requirements mandate immutable backups with audit trails
  • RTO targets below 30 minutes require automated orchestration
  • Multi-cloud DR strategy includes AWS, Azure, or GCP as failover targets
  • Team lacks deep Kubernetes expertise to build custom backup automation
  • Cost of downtime exceeds cost of licensing

Part 8: Production DR Planning

Defining RTO and RPO

Before implementing any backup strategy, establish your recovery objectives:

Recovery Time Objective (RTO): Maximum tolerable downtime

  • Tier 1 applications (customer-facing): 15-30 minutes
  • Tier 2 applications (internal tools): 2-4 hours
  • Tier 3 applications (reporting, analytics): 24 hours

Recovery Point Objective (RPO): Maximum acceptable data loss

  • Financial transactions: Seconds (requires transaction log shipping)
  • Customer data: Minutes (requires frequent incremental backups)
  • Analytics data: Hours to days (daily backups acceptable)

Map applications to tiers:

Application Tier RTO RPO Backup Strategy
Payment API 1 15 min 1 min Velero hourly + transaction logs
Web Frontend 2 1 hour 1 hour Velero + Git deployment tracking
PostgreSQL 1 15 min 5 min pg_dump + WAL archiving + Velero
Redis Cache 2 30 min 15 min RDB snapshots + Velero
Elasticsearch 2 2 hours 1 hour Snapshot API + Velero
Batch Jobs 3 24 hours 24 hours etcd backup (stateless)

Multi-Region DR Architecture

Production-grade DR requires geographic separation:

Primary Region (Region-A):
├─ Kubernetes Cluster A1 (production)
│  ├─ Control Plane: 3 nodes
│  ├─ Workers: 10 nodes
│  └─ Storage: Ceph cluster (3 OSDs)
├─ Backup Infrastructure
│  ├─ etcd backups → Swift (Region-A)
│  ├─ Volume snapshots → Cinder (Region-A)
│  └─ Velero backups → Swift (Region-A)
└─ Replication Jobs
   └─ Copy to Region-B (hourly)

DR Region (Region-B):
├─ Kubernetes Cluster B1 (standby)
│  ├─ Control Plane: 3 nodes (minimal)
│  ├─ Workers: 3 nodes (scaled up during DR)
│  └─ Storage: Ceph cluster (1 OSD, expandable)
└─ Backup Storage
   ├─ Swift (Region-B) - replica of Region-A backups
   └─ S3 (AWS) - offsite copy for catastrophic failures

Implement automated replication:

#!/bin/bash
# Replicate Swift backups to DR region

# Source region Swift
export OS_AUTH_URL_SRC=https://region-a.openstack.example.com:5000/v3
export OS_USERNAME_SRC=backup-user
export OS_PASSWORD_SRC=secret

# DR region Swift
export OS_AUTH_URL_DST=https://region-b.openstack.example.com:5000/v3
export OS_USERNAME_DST=backup-user
export OS_PASSWORD_DST=secret

# List backups created in last 24 hours
swift list --prefix etcd-backup- etcd-backups | \
  while read backup; do
    # Check if already replicated
    OS_AUTH_URL=$OS_AUTH_URL_DST \
    OS_USERNAME=$OS_USERNAME_DST \
    OS_PASSWORD=$OS_PASSWORD_DST \
    swift stat etcd-backups "$backup" &>/dev/null
    
    if [ $? -ne 0 ]; then
      # Download from source
      OS_AUTH_URL=$OS_AUTH_URL_SRC \
      OS_USERNAME=$OS_USERNAME_SRC \
      OS_PASSWORD=$OS_PASSWORD_SRC \
      swift download etcd-backups "$backup"
      
      # Upload to destination
      OS_AUTH_URL=$OS_AUTH_URL_DST \
      OS_USERNAME=$OS_USERNAME_DST \
      OS_PASSWORD=$OS_PASSWORD_DST \
      swift upload etcd-backups "$backup"
      
      # Clean up
      rm "$backup"
      
      echo "Replicated: $backup"
    fi
  done

DR Testing Runbook

Test disaster recovery quarterly to validate procedures and measure actual RTO:

Phase 1: Environment Validation (T+0 to T+10min)

# Verify DR cluster accessibility
kubectl --context=dr-cluster get nodes

# Check backup availability
velero backup get
swift list etcd-backups | tail -n 5

# Validate storage capacity
kubectl --context=dr-cluster get pv

Phase 2: Control Plane Recovery (T+10min to T+20min)

# Restore latest etcd backup
latest_backup=$(swift list etcd-backups | tail -n 1)
swift download etcd-backups $latest_backup

# Perform etcd restore on DR cluster
ETCDCTL_API=3 etcdctl snapshot restore $latest_backup \
  --data-dir=/var/lib/etcd-dr

# Verify cluster state
kubectl get namespaces
kubectl get pods -A

Phase 3: Workload Recovery (T+20min to T+40min)

# Restore production namespace with Velero
velero restore create dr-test-$(date +%Y%m%d) \
  --from-backup production-daily-latest \
  --namespace-mappings production:production-dr \
  --wait

# Monitor restore progress
watch kubectl get pods -n production-dr

# Verify StatefulSets
kubectl get statefulsets -n production-dr
kubectl get pvc -n production-dr

Phase 4: Application Validation (T+40min to T+50min)

# Check database connectivity
kubectl exec -n production-dr postgresql-0 -- psql -c "\l"

# Verify application health
kubectl exec -n production-dr api-0 -- curl -f http://localhost:8080/health

# Test external connectivity
curl -f https://dr-api.example.com/health

Phase 5: Documentation (T+50min to T+60min)

  • Record actual RTO achieved
  • Note any issues encountered
  • Update runbook with lessons learned
  • Generate compliance report

Backup Validation Strategy

Backups are worthless if they can’t restore. Implement automated validation:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-validator
  namespace: backup-system
spec:
  schedule: "0 6 * * 0"  # Weekly Sunday 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: validator
            image: custom/backup-validator:latest
            command:
            - /bin/bash
            - -c
            - |
              # Get latest Velero backup
              BACKUP=$(velero backup get -o json | \
                jq -r '.items | sort_by(.status.completionTimestamp) | .[-1].metadata.name')
              
              # Create test namespace
              kubectl create namespace backup-test-$(date +%s)
              
              # Restore to test namespace
              velero restore create test-restore-$(date +%s) \
                --from-backup $BACKUP \
                --namespace-mappings production:backup-test-$(date +%s) \
                --wait
              
              # Validate database restore
              kubectl exec -n backup-test-* postgresql-0 -- \
                psql -c "SELECT COUNT(*) FROM users;" | \
                grep -q "^[0-9]\+$"
              
              if [ $? -eq 0 ]; then
                echo "SUCCESS: Backup validation passed"
                # Send success notification
                curl -X POST https://monitoring.example.com/alert \
                  -d "status=success&backup=$BACKUP"
              else
                echo "FAILURE: Backup validation failed"
                # Send failure alert
                curl -X POST https://monitoring.example.com/alert \
                  -d "status=failure&backup=$BACKUP"
              fi
              
              # Cleanup test namespace
              kubectl delete namespace backup-test-*
          restartPolicy: OnFailure

This validator:

  • Restores latest backup to isolated namespace
  • Verifies database integrity
  • Tests application functionality
  • Alerts on validation failures
  • Cleans up test resources

Part 9: Troubleshooting Common Backup Failures

etcd Snapshot Failures

Symptom: CronJob fails with “connection refused” error

Diagnosis:

# Check etcd pod status
kubectl get pods -n kube-system | grep etcd

# Verify certificates
ls -la /etc/kubernetes/pki/etcd/

# Test etcdctl connectivity
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

Solution: Update CronJob with correct certificate paths and ensure hostNetwork access.

Volume Snapshot Stuck in Pending

Symptom: VolumeSnapshot remains in “Pending” state indefinitely

Diagnosis:

# Check snapshot status
kubectl describe volumesnapshot postgres-snapshot-20250119

# Check CSI driver logs
kubectl logs -n kube-system csi-cinder-controllerplugin-*

# Verify VolumeSnapshotClass
kubectl get volumesnapshotclass

Common causes:

  • VolumeSnapshotClass not defined
  • CSI driver doesn’t support snapshots
  • Insufficient Cinder quota
  • OpenStack credentials expired

Solution:

# Create VolumeSnapshotClass
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: cinder-snapclass
driver: cinder.csi.openstack.org
deletionPolicy: Delete
EOF

Velero Backup Incomplete

Symptom: Velero backup shows “PartiallyFailed” status

Diagnosis:

# Get detailed backup information
velero backup describe production-daily-20250119010000

# Check backup logs
velero backup logs production-daily-20250119010000

Common issues:

  • PVs without snapshot support (local volumes)
  • Pods stuck in CrashLoopBackOff
  • Pre-backup hooks timeout
  • Insufficient Swift storage quota

Solution: Exclude problematic resources:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: production-backup
spec:
  excludedResources:
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  labelSelector:
    matchExpressions:
    - key: velero.io/exclude-from-backup
      operator: NotIn
      values:
      - "true"

Out of Space During Restore

Symptom: PVC creation fails with “exceeded quota” during restore

Diagnosis:

# Check Cinder quota
openstack quota show

# Check PV usage
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,STATUS:.status.phase

Solution: Increase OpenStack quota or clean up unused PVs:

# Delete unused PVs
kubectl get pv | grep Released | awk '{print $1}' | xargs kubectl delete pv

# Request quota increase
openstack quota set --volumes 100 --gigabytes 10000 PROJECT_ID

Conclusion: Building a Production-Ready Backup Strategy

Comprehensive Kubernetes backup on OpenStack requires coordinated protection across three layers:

  1. Control plane backup: etcd snapshots every 6 hours, retained for 30 days, replicated to DR region
  2. Volume-level backup: CSI snapshots for fast recovery, application-aware dumps for long-term retention
  3. Full workload backup: Velero or enterprise solutions for complete disaster recovery

No single tool solves all backup challenges. Successful strategies combine:

  • etcd backups for cluster state
  • Volume snapshots for rapid recovery
  • Application dumps for point-in-time consistency
  • Velero for coordinated cluster-wide backups
  • Enterprise solutions (like Storware) for complex multi-cluster environments

Your backup strategy should match your RTO/RPO requirements:

  • Aggressive RTO (<30min): Automated failover with warm standby clusters
  • Standard RTO (2-4hr): Velero + manual failover procedures
  • Relaxed RTO (24hr): etcd + volume snapshots + documented recovery

Most importantly: test your backups. Quarterly DR drills expose gaps in procedures, validate RTOs, and build team muscle memory. The backup strategy that sits untested is the backup strategy that fails when you need it most.

For production OpenStack environments running business-critical Kubernetes workloads, investing in robust backup automation pays dividends. Whether you build with open-source tools or adopt enterprise solutions, the key is coverage, automation, and validation.

About Storware

Storware Backup and Recovery provides enterprise-grade data protection for OpenStack, Kubernetes, and hybrid cloud environments. Our OpenStack-native approach ensures comprehensive backup coverage with application-aware protection, automated DR orchestration, and multi-cloud recovery capabilities.

Learn more about Storware for OpenStack →

 

text written by:

Łukasz Błocki, Professional Services Architect