Update RESTORE.md with tested in-place restore procedure
This commit is contained in:
216
RESTORE.md
216
RESTORE.md
@@ -1,22 +1,69 @@
|
|||||||
# Restore Procedure Runbook
|
# Restore Procedure Runbook
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate.
|
||||||
|
|
||||||
|
**Tested Applications:**
|
||||||
|
- **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net.
|
||||||
|
- **mailhog** (monthly): Stateless app for SMTP testing.
|
||||||
|
- **gitea** (one-shot validation): Full production restore to validate entire strategy.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- kubectl/oc access to cluster
|
||||||
|
- Velero CLI installed (optional but helpful)
|
||||||
|
- ArgoCD access to pause reconciliation
|
||||||
|
- Recent backup available for target namespace
|
||||||
|
|
||||||
## Set Variables
|
## Set Variables
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
VELERO_NS=openshift-adp
|
VELERO_NS=openshift-adp
|
||||||
SRC_NS=n8n
|
SRC_NS=n8n # Namespace to restore (n8n, mailhog, etc)
|
||||||
TS=$(date +%Y%m%d-%H%M%S)
|
TS=$(date +%Y%m%d-%H%M%S)
|
||||||
DST_NS=n8n-restore-test-$TS
|
RESTORE_NAME=${SRC_NS}-restore-${TS}
|
||||||
RESTORE_NAME=n8n-restore-test-$TS
|
BACKUP_NAME=daily-stateful-* # Use most recent backup, or specify exact name
|
||||||
TEST_HOST=n8n-restore-$TS.apilab.us
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Create Namespace
|
## Step 1: Pause GitOps Reconciliation
|
||||||
|
|
||||||
|
Pause ArgoCD to prevent it from recreating resources while we're testing restore:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
oc create ns $DST_NS
|
oc patch appproject infrastructure -n openshift-gitops \
|
||||||
|
-p '{"spec": {"sourceNamespaces": []}}' --type merge
|
||||||
|
|
||||||
|
# Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual
|
||||||
```
|
```
|
||||||
|
|
||||||
## Apply Restore
|
> **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test.
|
||||||
|
|
||||||
|
## Step 2: Delete Target Namespace
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo "Deleting namespace: $SRC_NS"
|
||||||
|
oc delete ns $SRC_NS --wait=true
|
||||||
|
|
||||||
|
# Verify it's gone
|
||||||
|
oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted"
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note**: PersistentVolumes and backups remain intact.
|
||||||
|
|
||||||
|
## Step 3: Get Latest Backup Name
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List recent backups for the namespace
|
||||||
|
velero backup get --filter=includedNamespaces=$SRC_NS
|
||||||
|
|
||||||
|
# Or via kubectl
|
||||||
|
BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}')
|
||||||
|
echo "Using backup: $BACKUP_NAME"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 4: Create Restore Resource
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cat <<EOF | oc apply -f -
|
cat <<EOF | oc apply -f -
|
||||||
apiVersion: velero.io/v1
|
apiVersion: velero.io/v1
|
||||||
@@ -29,55 +76,144 @@ spec:
|
|||||||
includeClusterResources: false
|
includeClusterResources: false
|
||||||
includedNamespaces:
|
includedNamespaces:
|
||||||
- $SRC_NS
|
- $SRC_NS
|
||||||
namespaceMapping:
|
|
||||||
$SRC_NS: $DST_NS
|
|
||||||
restorePVs: true
|
restorePVs: true
|
||||||
excludedResources:
|
excludedResources:
|
||||||
- routes.route.openshift.io
|
- routes.route.openshift.io # Routes are environment-specific
|
||||||
EOF
|
EOF
|
||||||
```
|
```
|
||||||
|
|
||||||
## Monitor Restore
|
## Step 5: Monitor Restore Progress
|
||||||
```bash
|
|
||||||
watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\n\"}'"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Check when complete
|
|
||||||
```bash
|
```bash
|
||||||
|
# Watch restore status
|
||||||
|
watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\\n\"}'"
|
||||||
|
|
||||||
|
# When complete, check for errors
|
||||||
oc -n $VELERO_NS describe restore $RESTORE_NAME
|
oc -n $VELERO_NS describe restore $RESTORE_NAME
|
||||||
```
|
```
|
||||||
|
|
||||||
## Monitor Deployments
|
**Expected phases:** New → InProgress → Completed (or PartiallyFailed)
|
||||||
|
|
||||||
|
## Step 6: Validate Application Functionality
|
||||||
|
|
||||||
|
### For n8n:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
oc -n $DST_NS rollout status deploy/postgres --timeout=10m
|
# Wait for pods to be ready
|
||||||
oc -n $DST_NS rollout status deploy/n8n --timeout=10m
|
oc -n $SRC_NS rollout status statefulset/postgres --timeout=5m
|
||||||
|
oc -n $SRC_NS rollout status deployment/n8n --timeout=5m
|
||||||
|
|
||||||
|
# Check if data is intact
|
||||||
|
oc -n $SRC_NS logs -l app.kubernetes.io/name=n8n -c n8n --tail=50 | grep -i "started\|error\|failed"
|
||||||
|
|
||||||
|
# Port-forward and test UI
|
||||||
|
oc -n $SRC_NS port-forward svc/n8n 5678:5678 &
|
||||||
|
sleep 2
|
||||||
|
curl -s http://localhost:5678/healthz | jq .
|
||||||
|
kill %1
|
||||||
```
|
```
|
||||||
|
|
||||||
## Create Route and Test
|
### For mailhog:
|
||||||
```bash
|
|
||||||
cat <<EOF | oc -n $DST_NS apply -f -
|
|
||||||
apiVersion: route.openshift.io/v1
|
|
||||||
kind: Route
|
|
||||||
metadata:
|
|
||||||
name: n8n-restore-test
|
|
||||||
spec:
|
|
||||||
host: $TEST_HOST
|
|
||||||
path: /
|
|
||||||
to:
|
|
||||||
kind: Service
|
|
||||||
name: n8n
|
|
||||||
port:
|
|
||||||
targetPort: 5678
|
|
||||||
tls:
|
|
||||||
termination: edge
|
|
||||||
insecureEdgeTerminationPolicy: Redirect
|
|
||||||
EOF
|
|
||||||
|
|
||||||
curl -kfsS https://$TEST_HOST/ >/dev/null && echo "PASS: UI reachable"
|
```bash
|
||||||
|
oc -n $SRC_NS rollout status deployment/mailhog --timeout=5m
|
||||||
|
|
||||||
|
# Verify service is responding
|
||||||
|
oc -n $SRC_NS port-forward svc/mailhog 1025:1025 8025:8025 &
|
||||||
|
sleep 2
|
||||||
|
curl -s http://localhost:8025/ | head -20
|
||||||
|
kill %1
|
||||||
```
|
```
|
||||||
|
|
||||||
## Cleanup
|
### For any application:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# General validation
|
||||||
|
oc -n $SRC_NS get all
|
||||||
|
oc -n $SRC_NS get pvc
|
||||||
|
oc -n $SRC_NS get secrets
|
||||||
|
oc -n $SRC_NS get configmap
|
||||||
|
|
||||||
|
# Check velero labels (proof of restore)
|
||||||
|
oc -n $SRC_NS get deployment -o jsonpath='{.items[0].metadata.labels.velero\.io/restore-name}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 7: Resume GitOps Reconciliation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Re-enable ArgoCD
|
||||||
|
oc patch appproject infrastructure -n openshift-gitops \
|
||||||
|
-p '{"spec": {"sourceNamespaces": ["*"]}}' --type merge
|
||||||
|
|
||||||
|
# Or via ArgoCD UI: Re-enable Auto-Sync
|
||||||
|
|
||||||
|
# Monitor for reconciliation
|
||||||
|
watch -n 5 "oc -n openshift-gitops get applications.argoproj.io -l argocd.argoproj.io/instance=infrastructure"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 8: Monitor for Reconciliation Flapping
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Watch for any conflicts or drift
|
||||||
|
oc -n $SRC_NS get events --sort-by='.lastTimestamp' | tail -20
|
||||||
|
|
||||||
|
# Check if deployments are stable
|
||||||
|
oc -n $SRC_NS rollout status deployment/$SRC_NS --timeout=5m
|
||||||
|
|
||||||
|
# Verify no pending changes in ArgoCD
|
||||||
|
aoc app get infrastructure-$SRC_NS # Check sync status
|
||||||
|
```
|
||||||
|
|
||||||
|
> If you see repeated reconciliation or conflicts, check:
|
||||||
|
> - Are there immutable fields that changed?
|
||||||
|
> - Did Velero inject labels that conflict with Helm?
|
||||||
|
> - Is GitOps trying to scale/restart pods?
|
||||||
|
|
||||||
|
## Step 9: Cleanup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete the restore resource
|
||||||
oc -n $VELERO_NS delete restore $RESTORE_NAME
|
oc -n $VELERO_NS delete restore $RESTORE_NAME
|
||||||
oc delete ns $DST_NS
|
|
||||||
|
# (Namespace stays running - that's the point!)
|
||||||
|
echo "✓ Restore test complete. $SRC_NS is now running from backup."
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Restore shows PartiallyFailed
|
||||||
|
|
||||||
|
```bash
|
||||||
|
oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:"
|
||||||
|
velero restore logs $RESTORE_NAME # If Velero CLI is installed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pods stuck in Pending
|
||||||
|
|
||||||
|
```bash
|
||||||
|
oc -n $SRC_NS describe pod <pod-name>
|
||||||
|
oc -n $SRC_NS get pvc # Check if PVCs are bound
|
||||||
|
oc get pv | grep $SRC_NS
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data looks wrong
|
||||||
|
|
||||||
|
- Check if you restored the correct backup
|
||||||
|
- For databases (n8n, postgres): Check logs for corruption warnings
|
||||||
|
- If corrupted: Re-delete namespace and restore from earlier backup
|
||||||
|
|
||||||
|
## Testing Schedule
|
||||||
|
|
||||||
|
- **Monthly**: n8n and mailhog (in-place, validated)
|
||||||
|
- **One-shot after major changes**: Full application restores to validate strategy
|
||||||
|
- **After backup retention policy changes**: Restore oldest available backup to verify
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
✅ Namespace deleted cleanly
|
||||||
|
✅ Restore completes without PartiallyFailed
|
||||||
|
✅ All pods reach Running state
|
||||||
|
✅ Application data is intact and queryable
|
||||||
|
✅ UI/APIs respond correctly
|
||||||
|
✅ GitOps reconciliation completes without conflicts
|
||||||
|
✅ velero.io/restore-name label visible on resources
|
||||||
|
|||||||
Reference in New Issue
Block a user