diff --git a/RESTORE.md b/RESTORE.md index f424093..751fa01 100644 --- a/RESTORE.md +++ b/RESTORE.md @@ -1,22 +1,69 @@ # Restore Procedure Runbook +## Overview + +This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate. + +**Tested Applications:** +- **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net. +- **mailhog** (monthly): Stateless app for SMTP testing. +- **gitea** (one-shot validation): Full production restore to validate entire strategy. + +## Prerequisites + +- kubectl/oc access to cluster +- Velero CLI installed (optional but helpful) +- ArgoCD access to pause reconciliation +- Recent backup available for target namespace + ## Set Variables ```bash VELERO_NS=openshift-adp -SRC_NS=n8n +SRC_NS=n8n # Namespace to restore (n8n, mailhog, etc) TS=$(date +%Y%m%d-%H%M%S) -DST_NS=n8n-restore-test-$TS -RESTORE_NAME=n8n-restore-test-$TS -TEST_HOST=n8n-restore-$TS.apilab.us +RESTORE_NAME=${SRC_NS}-restore-${TS} +BACKUP_NAME=daily-stateful-* # Use most recent backup, or specify exact name ``` -## Create Namespace +## Step 1: Pause GitOps Reconciliation + +Pause ArgoCD to prevent it from recreating resources while we're testing restore: + ```bash -oc create ns $DST_NS +oc patch appproject infrastructure -n openshift-gitops \ + -p '{"spec": {"sourceNamespaces": []}}' --type merge + +# Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual ``` -## Apply Restore +> **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test. + +## Step 2: Delete Target Namespace + +```bash +echo "Deleting namespace: $SRC_NS" +oc delete ns $SRC_NS --wait=true + +# Verify it's gone +oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted" +``` + +> **Note**: PersistentVolumes and backups remain intact. + +## Step 3: Get Latest Backup Name + +```bash +# List recent backups for the namespace +velero backup get --filter=includedNamespaces=$SRC_NS + +# Or via kubectl +BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}') +echo "Using backup: $BACKUP_NAME" +``` + +## Step 4: Create Restore Resource + ```bash cat </dev/null && echo "PASS: UI reachable" +```bash +oc -n $SRC_NS rollout status deployment/mailhog --timeout=5m + +# Verify service is responding +oc -n $SRC_NS port-forward svc/mailhog 1025:1025 8025:8025 & +sleep 2 +curl -s http://localhost:8025/ | head -20 +kill %1 ``` -## Cleanup +### For any application: + ```bash +# General validation +oc -n $SRC_NS get all +oc -n $SRC_NS get pvc +oc -n $SRC_NS get secrets +oc -n $SRC_NS get configmap + +# Check velero labels (proof of restore) +oc -n $SRC_NS get deployment -o jsonpath='{.items[0].metadata.labels.velero\.io/restore-name}' +``` + +## Step 7: Resume GitOps Reconciliation + +```bash +# Re-enable ArgoCD +oc patch appproject infrastructure -n openshift-gitops \ + -p '{"spec": {"sourceNamespaces": ["*"]}}' --type merge + +# Or via ArgoCD UI: Re-enable Auto-Sync + +# Monitor for reconciliation +watch -n 5 "oc -n openshift-gitops get applications.argoproj.io -l argocd.argoproj.io/instance=infrastructure" +``` + +## Step 8: Monitor for Reconciliation Flapping + +```bash +# Watch for any conflicts or drift +oc -n $SRC_NS get events --sort-by='.lastTimestamp' | tail -20 + +# Check if deployments are stable +oc -n $SRC_NS rollout status deployment/$SRC_NS --timeout=5m + +# Verify no pending changes in ArgoCD +aoc app get infrastructure-$SRC_NS # Check sync status +``` + +> If you see repeated reconciliation or conflicts, check: +> - Are there immutable fields that changed? +> - Did Velero inject labels that conflict with Helm? +> - Is GitOps trying to scale/restart pods? + +## Step 9: Cleanup + +```bash +# Delete the restore resource oc -n $VELERO_NS delete restore $RESTORE_NAME -oc delete ns $DST_NS -``` \ No newline at end of file + +# (Namespace stays running - that's the point!) +echo "✓ Restore test complete. $SRC_NS is now running from backup." +``` + +## Troubleshooting + +### Restore shows PartiallyFailed + +```bash +oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:" +velero restore logs $RESTORE_NAME # If Velero CLI is installed +``` + +### Pods stuck in Pending + +```bash +oc -n $SRC_NS describe pod +oc -n $SRC_NS get pvc # Check if PVCs are bound +oc get pv | grep $SRC_NS +``` + +### Data looks wrong + +- Check if you restored the correct backup +- For databases (n8n, postgres): Check logs for corruption warnings +- If corrupted: Re-delete namespace and restore from earlier backup + +## Testing Schedule + +- **Monthly**: n8n and mailhog (in-place, validated) +- **One-shot after major changes**: Full application restores to validate strategy +- **After backup retention policy changes**: Restore oldest available backup to verify + +## Success Criteria + +✅ Namespace deleted cleanly +✅ Restore completes without PartiallyFailed +✅ All pods reach Running state +✅ Application data is intact and queryable +✅ UI/APIs respond correctly +✅ GitOps reconciliation completes without conflicts +✅ velero.io/restore-name label visible on resources