# Restore Procedure Runbook ## Overview This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate. **Tested Applications:** - **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net. - **mailhog** (monthly): Stateless app for SMTP testing. - **gitea** (one-shot validation): Full production restore to validate entire strategy. ## Prerequisites - kubectl/oc access to cluster - Velero CLI installed (optional but helpful) - ArgoCD access to pause reconciliation - Recent backup available for target namespace ## Set Variables ```bash VELERO_NS=openshift-adp SRC_NS=n8n # Namespace to restore (n8n, mailhog, etc) TS=$(date +%Y%m%d-%H%M%S) RESTORE_NAME=${SRC_NS}-restore-${TS} BACKUP_NAME=daily-stateful-* # Use most recent backup, or specify exact name ``` ## Step 1: Pause GitOps Reconciliation Pause ArgoCD to prevent it from recreating resources while we're testing restore: ```bash oc patch appproject infrastructure -n openshift-gitops \ -p '{"spec": {"sourceNamespaces": []}}' --type merge # Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual ``` > **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test. ## Step 2: Delete Target Namespace ```bash echo "Deleting namespace: $SRC_NS" oc delete ns $SRC_NS --wait=true # Verify it's gone oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted" ``` > **Note**: PersistentVolumes and backups remain intact. ## Step 3: Get Latest Backup Name ```bash # List recent backups for the namespace velero backup get --filter=includedNamespaces=$SRC_NS # Or via kubectl BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}') echo "Using backup: $BACKUP_NAME" ``` ## Step 4: Create Restore Resource ```bash cat < If you see repeated reconciliation or conflicts, check: > - Are there immutable fields that changed? > - Did Velero inject labels that conflict with Helm? > - Is GitOps trying to scale/restart pods? ## Step 9: Cleanup ```bash # Delete the restore resource oc -n $VELERO_NS delete restore $RESTORE_NAME # (Namespace stays running - that's the point!) echo "✓ Restore test complete. $SRC_NS is now running from backup." ``` ## Troubleshooting ### Restore shows PartiallyFailed ```bash oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:" velero restore logs $RESTORE_NAME # If Velero CLI is installed ``` ### Pods stuck in Pending ```bash oc -n $SRC_NS describe pod oc -n $SRC_NS get pvc # Check if PVCs are bound oc get pv | grep $SRC_NS ``` ### Data looks wrong - Check if you restored the correct backup - For databases (n8n, postgres): Check logs for corruption warnings - If corrupted: Re-delete namespace and restore from earlier backup ## Testing Schedule - **Monthly**: n8n and mailhog (in-place, validated) - **One-shot after major changes**: Full application restores to validate strategy - **After backup retention policy changes**: Restore oldest available backup to verify ## Success Criteria ✅ Namespace deleted cleanly ✅ Restore completes without PartiallyFailed ✅ All pods reach Running state ✅ Application data is intact and queryable ✅ UI/APIs respond correctly ✅ GitOps reconciliation completes without conflicts ✅ velero.io/restore-name label visible on resources