Update RESTORE.md with tested in-place restore procedure

2026-01-10 08:49:12 +00:00
parent 7797fab52e
commit 3b9c703bb0
1 changed files with 177 additions and 41 deletions
--- a/RESTORE.md
+++ b/RESTORE.md
@@ -1,22 +1,69 @@
 # Restore Procedure Runbook
 ## Overview
 This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate.
 **Tested Applications:**
 - **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net.
 - **mailhog** (monthly): Stateless app for SMTP testing.
 - **gitea** (one-shot validation): Full production restore to validate entire strategy.
 ## Prerequisites
 - kubectl/oc access to cluster
 - Velero CLI installed (optional but helpful)
 - ArgoCD access to pause reconciliation
 - Recent backup available for target namespace
 ## Set Variables
 ```bash
 VELERO_NS=openshift-adp
-SRC_NS=n8n
+SRC_NS=n8n              # Namespace to restore (n8n, mailhog, etc)
 TS=$(date +%Y%m%d-%H%M%S)
-DST_NS=n8n-restore-test-$TS
+RESTORE_NAME=${SRC_NS}-restore-${TS}
-RESTORE_NAME=n8n-restore-test-$TS
+BACKUP_NAME=daily-stateful-*  # Use most recent backup, or specify exact name
 TEST_HOST=n8n-restore-$TS.apilab.us
 ```
-## Create Namespace
+## Step 1: Pause GitOps Reconciliation
 Pause ArgoCD to prevent it from recreating resources while we're testing restore:
 ```bash
-oc create ns $DST_NS
+oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": []}}' --type merge
 # Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual
 ```
-## Apply Restore
+> **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test.
 ## Step 2: Delete Target Namespace
 ```bash
 echo "Deleting namespace: $SRC_NS"
 oc delete ns $SRC_NS --wait=true
 # Verify it's gone
 oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted"
 ```
 > **Note**: PersistentVolumes and backups remain intact.
 ## Step 3: Get Latest Backup Name
 ```bash
 # List recent backups for the namespace
 velero backup get --filter=includedNamespaces=$SRC_NS
 # Or via kubectl
 BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}')
 echo "Using backup: $BACKUP_NAME"
 ```
 ## Step 4: Create Restore Resource
 ```bash
 cat <<EOF | oc apply -f -
 apiVersion: velero.io/v1
@@ -29,55 +76,144 @@ spec:
  includeClusterResources: false
  includedNamespaces:
    - $SRC_NS
  namespaceMapping:
    $SRC_NS: $DST_NS
  restorePVs: true
  excludedResources:
-    - routes.route.openshift.io
+    - routes.route.openshift.io  # Routes are environment-specific
 EOF
 ```
-## Monitor Restore
+## Step 5: Monitor Restore Progress
 ```bash
 watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\n\"}'"
 ```
 ## Check when complete
 ```bash
 # Watch restore status
 watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\\n\"}'"
 # When complete, check for errors
 oc -n $VELERO_NS describe restore $RESTORE_NAME
 ```
-## Monitor Deployments
+**Expected phases:** New → InProgress → Completed (or PartiallyFailed)
 ## Step 6: Validate Application Functionality
 ### For n8n:
 ```bash
-oc -n $DST_NS rollout status deploy/postgres --timeout=10m
+# Wait for pods to be ready
-oc -n $DST_NS rollout status deploy/n8n --timeout=10m
+oc -n $SRC_NS rollout status statefulset/postgres --timeout=5m
 oc -n $SRC_NS rollout status deployment/n8n --timeout=5m
 # Check if data is intact
 oc -n $SRC_NS logs -l app.kubernetes.io/name=n8n -c n8n --tail=50 | grep -i "started\|error\|failed"
 # Port-forward and test UI
 oc -n $SRC_NS port-forward svc/n8n 5678:5678 &
 sleep 2
 curl -s http://localhost:5678/healthz | jq .
 kill %1
 ```
-## Create Route and Test
+### For mailhog:
 ```bash
 cat <<EOF | oc -n $DST_NS apply -f -
 apiVersion: route.openshift.io/v1
 kind: Route
 metadata:
  name: n8n-restore-test
 spec:
  host: $TEST_HOST
  path: /
  to:
    kind: Service
    name: n8n
  port:
    targetPort: 5678
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect
 EOF
-curl -kfsS https://$TEST_HOST/ >/dev/null && echo "PASS: UI reachable"
+```bash
 oc -n $SRC_NS rollout status deployment/mailhog --timeout=5m
 # Verify service is responding
 oc -n $SRC_NS port-forward svc/mailhog 1025:1025 8025:8025 &
 sleep 2
 curl -s http://localhost:8025/ | head -20
 kill %1
 ```
-## Cleanup
+### For any application:
 ```bash
 # General validation
 oc -n $SRC_NS get all
 oc -n $SRC_NS get pvc
 oc -n $SRC_NS get secrets
 oc -n $SRC_NS get configmap
 # Check velero labels (proof of restore)
 oc -n $SRC_NS get deployment -o jsonpath='{.items[0].metadata.labels.velero\.io/restore-name}'
 ```
 ## Step 7: Resume GitOps Reconciliation
 ```bash
 # Re-enable ArgoCD
 oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": ["*"]}}' --type merge
 # Or via ArgoCD UI: Re-enable Auto-Sync
 # Monitor for reconciliation
 watch -n 5 "oc -n openshift-gitops get applications.argoproj.io -l argocd.argoproj.io/instance=infrastructure"
 ```
 ## Step 8: Monitor for Reconciliation Flapping
 ```bash
 # Watch for any conflicts or drift
 oc -n $SRC_NS get events --sort-by='.lastTimestamp' | tail -20
 # Check if deployments are stable
 oc -n $SRC_NS rollout status deployment/$SRC_NS --timeout=5m
 # Verify no pending changes in ArgoCD
 aoc app get infrastructure-$SRC_NS  # Check sync status
 ```
 > If you see repeated reconciliation or conflicts, check:
 > - Are there immutable fields that changed?
 > - Did Velero inject labels that conflict with Helm?
 > - Is GitOps trying to scale/restart pods?
 ## Step 9: Cleanup
 ```bash
 # Delete the restore resource
 oc -n $VELERO_NS delete restore $RESTORE_NAME
-oc delete ns $DST_NS
+
 # (Namespace stays running - that's the point!)
 echo "✓ Restore test complete. $SRC_NS is now running from backup."
 ```
 ## Troubleshooting
 ### Restore shows PartiallyFailed
 ```bash
 oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:"
 velero restore logs $RESTORE_NAME  # If Velero CLI is installed
 ```
 ### Pods stuck in Pending
 ```bash
 oc -n $SRC_NS describe pod <pod-name>
 oc -n $SRC_NS get pvc  # Check if PVCs are bound
 oc get pv | grep $SRC_NS
 ```
 ### Data looks wrong
 - Check if you restored the correct backup
 - For databases (n8n, postgres): Check logs for corruption warnings
 - If corrupted: Re-delete namespace and restore from earlier backup
 ## Testing Schedule
 - **Monthly**: n8n and mailhog (in-place, validated)
 - **One-shot after major changes**: Full application restores to validate strategy
 - **After backup retention policy changes**: Restore oldest available backup to verify
 ## Success Criteria
 ✅ Namespace deleted cleanly  
 ✅ Restore completes without PartiallyFailed  
 ✅ All pods reach Running state  
 ✅ Application data is intact and queryable  
 ✅ UI/APIs respond correctly  
 ✅ GitOps reconciliation completes without conflicts  
 ✅ velero.io/restore-name label visible on resources