added openclaw and clawdbox backups

Update RESTORE.md with tested in-place restore procedure
remove snapshots againg :(
2026-02-02 18:31:53 +11:00 · 2026-01-10 08:49:12 +00:00 · 2026-01-03 20:06:29 +11:00 · 2026-01-03 08:47:35 +11:00 · 2026-01-03 08:45:32 +11:00 · 2026-01-03 08:44:41 +11:00
5 changed files with 311 additions and 73 deletions
--- a/RESTORE.md
+++ b/RESTORE.md
@@ -0,0 +1,219 @@
 # Restore Procedure Runbook
 ## Overview
 This runbook documents the tested disaster recovery procedure for restoring namespaces from OADP backups. The strategy is to perform in-place restores (replacing the original namespace) rather than parallel restores, which avoids resource conflicts and is simpler to validate.
 **Tested Applications:**
 - **n8n** (monthly): Stateful app with PostgreSQL database. Has independent flow backups as safety net.
 - **mailhog** (monthly): Stateless app for SMTP testing.
 - **gitea** (one-shot validation): Full production restore to validate entire strategy.
 ## Prerequisites
 - kubectl/oc access to cluster
 - Velero CLI installed (optional but helpful)
 - ArgoCD access to pause reconciliation
 - Recent backup available for target namespace
 ## Set Variables
 ```bash
 VELERO_NS=openshift-adp
 SRC_NS=n8n              # Namespace to restore (n8n, mailhog, etc)
 TS=$(date +%Y%m%d-%H%M%S)
 RESTORE_NAME=${SRC_NS}-restore-${TS}
 BACKUP_NAME=daily-stateful-*  # Use most recent backup, or specify exact name
 ```
 ## Step 1: Pause GitOps Reconciliation
 Pause ArgoCD to prevent it from recreating resources while we're testing restore:
 ```bash
 oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": []}}' --type merge
 # Or via ArgoCD UI: Edit the Application, set Auto-Sync to Manual
 ```
 > **Why**: GitOps will try to recreate namespaces/resources as they're deleted, interfering with the restore test.
 ## Step 2: Delete Target Namespace
 ```bash
 echo "Deleting namespace: $SRC_NS"
 oc delete ns $SRC_NS --wait=true
 # Verify it's gone
 oc get ns $SRC_NS 2>&1 | grep -i "not found" && echo "✓ Namespace deleted"
 ```
 > **Note**: PersistentVolumes and backups remain intact.
 ## Step 3: Get Latest Backup Name
 ```bash
 # List recent backups for the namespace
 velero backup get --filter=includedNamespaces=$SRC_NS
 # Or via kubectl
 BACKUP_NAME=$(oc -n $VELERO_NS get backup -o jsonpath='{.items[-1].metadata.name}')
 echo "Using backup: $BACKUP_NAME"
 ```
 ## Step 4: Create Restore Resource
 ```bash
 cat <<EOF | oc apply -f -
 apiVersion: velero.io/v1
 kind: Restore
 metadata:
  name: $RESTORE_NAME
  namespace: $VELERO_NS
 spec:
  backupName: $BACKUP_NAME
  includeClusterResources: false
  includedNamespaces:
    - $SRC_NS
  restorePVs: true
  excludedResources:
    - routes.route.openshift.io  # Routes are environment-specific
 EOF
 ```
 ## Step 5: Monitor Restore Progress
 ```bash
 # Watch restore status
 watch -n 5 "oc -n $VELERO_NS get restore $RESTORE_NAME -o jsonpath='{.status.phase}{\"\\n\"}'"
 # When complete, check for errors
 oc -n $VELERO_NS describe restore $RESTORE_NAME
 ```
 **Expected phases:** New → InProgress → Completed (or PartiallyFailed)
 ## Step 6: Validate Application Functionality
 ### For n8n:
 ```bash
 # Wait for pods to be ready
 oc -n $SRC_NS rollout status statefulset/postgres --timeout=5m
 oc -n $SRC_NS rollout status deployment/n8n --timeout=5m
 # Check if data is intact
 oc -n $SRC_NS logs -l app.kubernetes.io/name=n8n -c n8n --tail=50 | grep -i "started\|error\|failed"
 # Port-forward and test UI
 oc -n $SRC_NS port-forward svc/n8n 5678:5678 &
 sleep 2
 curl -s http://localhost:5678/healthz | jq .
 kill %1
 ```
 ### For mailhog:
 ```bash
 oc -n $SRC_NS rollout status deployment/mailhog --timeout=5m
 # Verify service is responding
 oc -n $SRC_NS port-forward svc/mailhog 1025:1025 8025:8025 &
 sleep 2
 curl -s http://localhost:8025/ | head -20
 kill %1
 ```
 ### For any application:
 ```bash
 # General validation
 oc -n $SRC_NS get all
 oc -n $SRC_NS get pvc
 oc -n $SRC_NS get secrets
 oc -n $SRC_NS get configmap
 # Check velero labels (proof of restore)
 oc -n $SRC_NS get deployment -o jsonpath='{.items[0].metadata.labels.velero\.io/restore-name}'
 ```
 ## Step 7: Resume GitOps Reconciliation
 ```bash
 # Re-enable ArgoCD
 oc patch appproject infrastructure -n openshift-gitops \
  -p '{"spec": {"sourceNamespaces": ["*"]}}' --type merge
 # Or via ArgoCD UI: Re-enable Auto-Sync
 # Monitor for reconciliation
 watch -n 5 "oc -n openshift-gitops get applications.argoproj.io -l argocd.argoproj.io/instance=infrastructure"
 ```
 ## Step 8: Monitor for Reconciliation Flapping
 ```bash
 # Watch for any conflicts or drift
 oc -n $SRC_NS get events --sort-by='.lastTimestamp' | tail -20
 # Check if deployments are stable
 oc -n $SRC_NS rollout status deployment/$SRC_NS --timeout=5m
 # Verify no pending changes in ArgoCD
 aoc app get infrastructure-$SRC_NS  # Check sync status
 ```
 > If you see repeated reconciliation or conflicts, check:
 > - Are there immutable fields that changed?
 > - Did Velero inject labels that conflict with Helm?
 > - Is GitOps trying to scale/restart pods?
 ## Step 9: Cleanup
 ```bash
 # Delete the restore resource
 oc -n $VELERO_NS delete restore $RESTORE_NAME
 # (Namespace stays running - that's the point!)
 echo "✓ Restore test complete. $SRC_NS is now running from backup."
 ```
 ## Troubleshooting
 ### Restore shows PartiallyFailed
 ```bash
 oc -n $VELERO_NS describe restore $RESTORE_NAME | grep -A 50 "Status:"
 velero restore logs $RESTORE_NAME  # If Velero CLI is installed
 ```
 ### Pods stuck in Pending
 ```bash
 oc -n $SRC_NS describe pod <pod-name>
 oc -n $SRC_NS get pvc  # Check if PVCs are bound
 oc get pv | grep $SRC_NS
 ```
 ### Data looks wrong
 - Check if you restored the correct backup
 - For databases (n8n, postgres): Check logs for corruption warnings
 - If corrupted: Re-delete namespace and restore from earlier backup
 ## Testing Schedule
 - **Monthly**: n8n and mailhog (in-place, validated)
 - **One-shot after major changes**: Full application restores to validate strategy
 - **After backup retention policy changes**: Restore oldest available backup to verify
 ## Success Criteria
 ✅ Namespace deleted cleanly  
 ✅ Restore completes without PartiallyFailed  
 ✅ All pods reach Running state  
 ✅ Application data is intact and queryable  
 ✅ UI/APIs respond correctly  
 ✅ GitOps reconciliation completes without conflicts  
 ✅ velero.io/restore-name label visible on resources  
--- a/manifests/backup-test-cronjob.yaml
+++ b/manifests/backup-test-cronjob.yaml
@@ -6,7 +6,8 @@ metadata:
  name: monthly-restore-test
  namespace: openshift-adp
 spec:
-  schedule: "0 06 15 * *"  # 15th of month, 6 AM
+  timeZone: "Australia/Sydney"
  schedule: "0 06 15 * *" # 15th of month, 6 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
@@ -20,44 +21,45 @@ spec:
          serviceAccountName: velero
          restartPolicy: OnFailure
          containers:
-          - name: restore-test
+            - name: restore-test
-            image: quay.io/konveyor/velero:latest
+              image: quay.io/konveyor/velero:latest
-            env:
+              env:
-              - name: VELERO_NAMESPACE
+                - name: VELERO_NAMESPACE
-                value: openshift-adp
+                  value: openshift-adp
-            command:
+              command:
-              - /bin/bash
+                - /bin/bash
-              - -c
+                - -c
-              - |
+                - |
-                set -e
+                  set -e
-                echo "=== Velero Restore Test ==="
+                  echo "=== Velero Restore Test ==="
-                echo "Date: $(date)"
+                  echo "Date: $(date)"
-                # Get latest daily-config backup
+                  # Get latest daily-config backup
-                CONFIG_BACKUP=$(velero backup get --selector="backup-type=config" \
+                  CONFIG_BACKUP=$(velero backup get --selector="backup-type=config" \
-                  -o json | jq -r '.items[0].metadata.name')
+                    -o json | jq -r '.items[0].metadata.name')
-                # Get latest daily-stateful backup  
+                  # Get latest daily-stateful backup  
-                STATEFUL_BACKUP=$(velero backup get --selector="backup-type=stateful" \
+                  STATEFUL_BACKUP=$(velero backup get --selector="backup-type=stateful" \
-                  -o json | jq -r '.items[0].metadata.name')
+                    -o json | jq -r '.items[0].metadata.name')
-                echo "Latest config backup: $CONFIG_BACKUP"
+                  echo "Latest config backup: $CONFIG_BACKUP"
-                echo "Latest stateful backup: $STATEFUL_BACKUP"
+                  echo "Latest stateful backup: $STATEFUL_BACKUP"
-                # Verify backups are successful
+                  # Verify backups are successful
-                CONFIG_STATUS=$(velero backup get $CONFIG_BACKUP -o json | \
+                  CONFIG_STATUS=$(velero backup get $CONFIG_BACKUP -o json | \
-                  jq -r '.status.phase')
+                    jq -r '.status.phase')
-                STATEFUL_STATUS=$(velero backup get $STATEFUL_BACKUP -o json | \
+                  STATEFUL_STATUS=$(velero backup get $STATEFUL_BACKUP -o json | \
-                  jq -r '.status.phase')
+                    jq -r '.status.phase')
-                echo "Config backup status: $CONFIG_STATUS"
+                  echo "Config backup status: $CONFIG_STATUS"
-                echo "Stateful backup status: $STATEFUL_STATUS"
+                  echo "Stateful backup status: $STATEFUL_STATUS"
-                if [ "$CONFIG_STATUS" != "Completed" ] || [ "$STATEFUL_STATUS" != "Completed" ]; then
+                  if [ "$CONFIG_STATUS" != "Completed" ] || [ "$STATEFUL_STATUS" != "Completed" ]; then
-                  echo "ERROR: Backups not in Completed state"
+                    echo "ERROR: Backups not in Completed state"
-                  exit 1
+                    exit 1
-                fi
+                  fi
                  echo "=== Test Passed ==="
                  echo "All backups verified successfully"
                echo "=== Test Passed ==="
                echo "All backups verified successfully"
--- a/manifests/config-backups.yaml
+++ b/manifests/config-backups.yaml
@@ -6,7 +6,7 @@ metadata:
  name: daily-config
  namespace: openshift-adp
 spec:
-  schedule: "0 02 * * *" # 2 AM daily
+  schedule: "CRON_TZ=Australia/Sydney 0 02 * * *" # 2 AM daily
  # Make backups readable, sortable, unique
  #nameTemplate: "{{ .ScheduleName }}-{{ .Timestamp }}"
--- a/manifests/dpa.yaml
+++ b/manifests/dpa.yaml
@@ -33,11 +33,11 @@ spec:
      podConfig:
        resourceAllocations:
          limits:
-            cpu: "1"        # Increased for database compression
+            cpu: 1
-            memory: "1Gi"   # Increased for larger chunks
+            memory: 1Gi
          requests:
-            cpu: "200m"
+            cpu: 200m
-            memory: "512Mi"
+            memory: 512Mi
    velero:
      defaultPlugins:
        - openshift
@@ -49,10 +49,10 @@ spec:
      podConfig:
        resourceAllocations:
          limits:
-            cpu: "500m"
+            cpu: 1
-            memory: "512Mi"
+            memory: 2Gi
          requests:
-            cpu: "100m"
+            cpu: 100m
-            memory: "256Mi"
+            memory: 512Mi
  logFormat: text
--- a/manifests/stateful-backups.yaml
+++ b/manifests/stateful-backups.yaml
@@ -6,7 +6,7 @@ metadata:
  name: daily-stateful
  namespace: openshift-adp
 spec:
-  schedule: "0 03 * * *" # 3 AM daily (after config backup)
+  schedule: "CRON_TZ=Australia/Sydney 0 03 * * *" # 3 AM daily (after config backup)
  #nameTemplate: "{{ .ScheduleName }}-{{ .Timestamp }}"
@@ -20,6 +20,8 @@ spec:
      - n8n
      - apim
      - gitea-ci
      - openclaw
      - clawdbox
    #labels:
    #  backup-type: stateful
@@ -32,8 +34,6 @@ spec:
      - events.events.k8s.io
      - pipelineruns.tekton.dev
      - taskruns.tekton.dev
      - replicasets.apps
      - pods
    # Use Kopia for volume backups
    snapshotVolumes: false
@@ -63,23 +63,23 @@ spec:
                onError: Continue
        # Gitea PostgreSQL: checkpoint before backup
-        - name: gitea-postgres-checkpoint
+        #- name: gitea-postgres-checkpoint
-          includedNamespaces:
+        #  includedNamespaces:
-            - gitea
+        #    - gitea
-          labelSelector:
+        #  labelSelector:
-            matchLabels:
+        #    matchLabels:
-              app.kubernetes.io/name: postgresql
+        #      app.kubernetes.io/name: postgresql
-              app.kubernetes.io/instance: gitea
+        #      app.kubernetes.io/instance: gitea
-          pre:
+        #  pre:
-            - exec:
+        #    - exec:
-                container: postgresql
+        #        container: postgresql
-                command:
+        #        command:
-                  - /bin/bash
+        #          - /bin/bash
-                  - -c
+        #          - -c
-                  - psql -U postgres -c 'CHECKPOINT;'
+        #          - PGPASSWORD=spVTpND34K psql -U postgres -c 'CHECKPOINT;'
-                timeout: 2m
+        #        timeout: 2m
-                onError: Continue
+        #        onError: Continue
-
+        # Authentik PostgreSQL: checkpoint before backup
        # Authentik PostgreSQL: checkpoint before backup
        - name: authentik-postgres-checkpoint
          includedNamespaces:
@@ -94,6 +94,23 @@ spec:
                command:
                  - /bin/bash
                  - -c
-                  - psql -U postgres -c 'CHECKPOINT;'
+                  - PGPASSWORD=th1rt33nletterS. psql -U authentik -c 'CHECKPOINT;'
                timeout: 2m
                onError: Continue
          # n8n PostgreSQL: checkpoint before backup
        - name: n8n-postgres-checkpoint
          includedNamespaces:
            - n8n
          labelSelector:
            matchLabels:
              app.kubernetes.io/service: postgres-n8n
          pre:
            - exec:
                container: postgres
                command:
                  - /bin/bash
                  - -c
                  - psql -d n8n -U root -c 'CHECKPOINT;'
                timeout: 2m
                onError: Continue
Author	SHA1	Message	Date
Conan Scott	58be12b666	added openclaw and clawdbox backups	2026-02-02 18:31:53 +11:00
Conan Scott	3b9c703bb0	Update RESTORE.md with tested in-place restore procedure	2026-01-10 08:49:12 +00:00
Conan Scott	7797fab52e	remove snapshots againg :(	2026-01-03 20:06:29 +11:00
Conan Scott	0aac47318a	try allowing snapshots for NFS	2026-01-03 08:47:35 +11:00
Conan Scott	8457a529f7	git weirndness	2026-01-03 08:45:32 +11:00
Conan Scott	64f3347058	Allow snapshits for NFS (hopefully)	2026-01-03 08:44:41 +11:00
Conan Scott	fed4eafbf6	disabled postgres hook	2026-01-02 10:06:41 +11:00
Conan Scott	e35b83caa7	updated hooks	2025-12-31 18:06:23 +11:00
Conan Scott	923ecd4cee	tweaked again	2025-12-31 15:06:42 +11:00
Conan Scott	759bb698ed	make argo happy	2025-12-31 15:03:33 +11:00
Conan Scott	57711863ef	damn	2025-12-31 15:00:25 +11:00
Conan Scott	22ea61b843	patched velero instead of node	2025-12-31 14:58:34 +11:00
Conan Scott	85c582d5da	increased limits because of OOM during restore	2025-12-31 14:53:52 +11:00
Conan Scott	9807075149	Committed runbook	2025-12-31 14:30:50 +11:00
Conan Scott	f85636f48f	removed pods from exclusions added timezones to backup and restore	2025-12-31 14:29:26 +11:00