# Restore Procedure Runbook

## Overview

This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate.

**Tested Applications:**
- **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net.
- **mailhog** (monthly): Stateless app for SMTP testing.
- **gitea** (one-shot validation): Full production restore to validate entire strategy.

## Prerequisites

- kubectl/oc access to cluster
- Velero CLI installed (optional but helpful)
- ArgoCD access to pause reconciliation
- Recent backup available for target namespace

## Set Variables

```bash
VELERO_NS=openshift-adp
SRC_NS=n8n              # Namespace to restore (n8n, mailhog, etc)
TS=$(date +%Y%m%d-%H%M%S)
RESTORE_NAME=${SRC_NS}-restore-${TS}
BACKUP_NAME=daily-stateful-*  # Use most recent backup, or specify exact name
```

## Step 1: Pause GitOps Reconciliation

Pause ArgoCD to prevent it from recreating resources while we're testing restore:

```bash
oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": []}}' --type merge

# Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual
```

> **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test.

## Step 2: Delete Target Namespace

```bash
echo "Deleting namespace: $SRC_NS"
oc delete ns $SRC_NS --wait=true

# Verify it's gone
oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted"
```

> **Note**: PersistentVolumes and backups remain intact.

## Step 3: Get Latest Backup Name

```bash
# List recent backups for the namespace
velero backup get --filter=includedNamespaces=$SRC_NS

# Or via kubectl
BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}')
echo "Using backup: $BACKUP_NAME"
```

## Step 4: Create Restore Resource

```bash
cat <<EOF | oc apply -f -
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: $RESTORE_NAME
  namespace: $VELERO_NS
spec:
  backupName: $BACKUP_NAME
  includeClusterResources: false
  includedNamespaces:
    - $SRC_NS
  restorePVs: true
  excludedResources:
    - routes.route.openshift.io  # Routes are environment-specific
EOF
```

## Step 5: Monitor Restore Progress

```bash
# Watch restore status
watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\\n\"}'"

# When complete, check for errors
oc -n $VELERO_NS describe restore $RESTORE_NAME
```

**Expected phases:** New → InProgress → Completed (or PartiallyFailed)

## Step 6: Validate Application Functionality

### For n8n:

```bash
# Wait for pods to be ready
oc -n $SRC_NS rollout status statefulset/postgres --timeout=5m
oc -n $SRC_NS rollout status deployment/n8n --timeout=5m

# Check if data is intact
oc -n $SRC_NS logs -l app.kubernetes.io/name=n8n -c n8n --tail=50 | grep -i "started\|error\|failed"

# Port-forward and test UI
oc -n $SRC_NS port-forward svc/n8n 5678:5678 &
sleep 2
curl -s http://localhost:5678/healthz | jq .
kill %1
```

### For mailhog:

```bash
oc -n $SRC_NS rollout status deployment/mailhog --timeout=5m

# Verify service is responding
oc -n $SRC_NS port-forward svc/mailhog 1025:1025 8025:8025 &
sleep 2
curl -s http://localhost:8025/ | head -20
kill %1
```

### For any application:

```bash
# General validation
oc -n $SRC_NS get all
oc -n $SRC_NS get pvc
oc -n $SRC_NS get secrets
oc -n $SRC_NS get configmap

# Check velero labels (proof of restore)
oc -n $SRC_NS get deployment -o jsonpath='{.items[0].metadata.labels.velero\.io/restore-name}'
```

## Step 7: Resume GitOps Reconciliation

```bash
# Re-enable ArgoCD
oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": ["*"]}}' --type merge

# Or via ArgoCD UI: Re-enable Auto-Sync

# Monitor for reconciliation
watch -n 5 "oc -n openshift-gitops get applications.argoproj.io -l argocd.argoproj.io/instance=infrastructure"
```

## Step 8: Monitor for Reconciliation Flapping

```bash
# Watch for any conflicts or drift
oc -n $SRC_NS get events --sort-by='.lastTimestamp' | tail -20

# Check if deployments are stable
oc -n $SRC_NS rollout status deployment/$SRC_NS --timeout=5m

# Verify no pending changes in ArgoCD
aoc app get infrastructure-$SRC_NS  # Check sync status
```

> If you see repeated reconciliation or conflicts, check:
> - Are there immutable fields that changed?
> - Did Velero inject labels that conflict with Helm?
> - Is GitOps trying to scale/restart pods?

## Step 9: Cleanup

```bash
# Delete the restore resource
oc -n $VELERO_NS delete restore $RESTORE_NAME

# (Namespace stays running - that's the point!)
echo "✓ Restore test complete. $SRC_NS is now running from backup."
```

## Troubleshooting

### Restore shows PartiallyFailed

```bash
oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:"
velero restore logs $RESTORE_NAME  # If Velero CLI is installed
```

### Pods stuck in Pending

```bash
oc -n $SRC_NS describe pod <pod-name>
oc -n $SRC_NS get pvc  # Check if PVCs are bound
oc get pv | grep $SRC_NS
```

### Data looks wrong

- Check if you restored the correct backup
- For databases (n8n, postgres): Check logs for corruption warnings
- If corrupted: Re-delete namespace and restore from earlier backup

## Testing Schedule

- **Monthly**: n8n and mailhog (in-place, validated)
- **One-shot after major changes**: Full application restores to validate strategy
- **After backup retention policy changes**: Restore oldest available backup to verify

## Success Criteria

✅ Namespace deleted cleanly  
✅ Restore completes without PartiallyFailed  
✅ All pods reach Running state  
✅ Application data is intact and queryable  
✅ UI/APIs respond correctly  
✅ GitOps reconciliation completes without conflicts  
✅ velero.io/restore-name label visible on resources