← Holocron Logs

The Alliance Fleet Backup Architecture: Snapshots, Local Storage, and Azure Offsite

A three-layer backup strategy for a three-node Proxmox cluster. Proxmox snapshots with retention tiers, nightly vzdump to a local external drive on Node-B, and an azcopy pipeline pushing everything to Azure Blob Storage at 3AM. The bugs that nearly killed the first run are worth documenting.

Why this matters beyond the homelab: This is the same 3-2-1 backup architecture used in enterprise environments: local snapshots for fast rollback, on-site secondary for same-day recovery, offsite cloud for disaster recovery. The Azure piece maps directly to AZ-104: storage accounts, blob containers, SAS token scoping, access tiers, and lifecycle management policies.


The Problem With One Backup Location

For most of 2025, the Alliance Fleet had one backup: nightly vzdump to an external drive on Node-B. That is not a backup strategy. That is a single point of failure with extra steps.

One drive failure, one accidental rm -rf, one Node-B hardware problem: everything is gone. Authentik configuration, Vaultwarden entries, n8n workflows, InfluxDB time-series data. Years of homelab work. Gone.

The fix was building a proper three-layer architecture before something actually broke.


Layer 1: Proxmox Snapshots

Snapshots are not backups. They live on the same storage as the VM. A disk failure takes the VM and the snapshot at the same time. But they are the fastest recovery path for the most common failure: a bad update, a misconfigured service, a “what did I just do” moment.

The fleet uses four retention tiers:

Tier 1 (7-14 days): Authentik, Vaultwarden, ALLIANCE-DC01, InfluxDB. These hold identity data, credentials, and time-series telemetry. Losing them means hours of recovery work minimum.

Tier 2 (3-5 days): NPM, Wazuh, n8n, Grafana, Tantive-III AI stack, Home Assistant, BD-1. Meaningful state but rebuildable within a day if needed.

Tier 3 (pre-update only): AdGuard, K-2SO, RustDesk, Telegraf agents. Stateless or near-stateless. Take a snapshot before touching them, delete it after confirming the update held.

Tier 4 (none): Canto-Bight (gaming VM), ComfyUI, the Ghost Squadron Raspberry Pis. Fully rebuildable from scratch in under an hour.

Node-B’s ZFS pool makes snapshots nearly free in terms of storage overhead. ZFS copy-on-write means a snapshot only consumes space for changed blocks. That is the correct home for Tier 1 workloads.

The snapshot command is the same for every VM:

qm snapshot <vmid> <snapname> --description "pre-update $(date +%Y-%m-%d)"

For LXC containers:

pct snapshot <ctid> <snapname> --description "pre-update $(date +%Y-%m-%d)"

Layer 2: Local vzdump to Node-B

All three nodes run nightly vzdump jobs at midnight pushing compressed backups to /mnt/backup-ext/dump/ on Node-B. This is the external drive layer: physically separate from the node SSDs, fast to restore from, and large enough to hold a rolling window of backups.

Retention settings across the fleet:

Tantive-III (the AI stack VM) is the exception. At 67GB per vzdump snapshot, keeping multiples ate through the drive fast. It is now excluded from the nightly job entirely. If Tantive-III needs rebuilding, the Ollama models re-download and the configuration is in git. That is an acceptable tradeoff for 67GB of recovered space.


Layer 3: Azure Offsite via azcopy

This is the disaster recovery layer. The local drive fails, the house floods, Node-B dies: the Azure copy survives.

The architecture:

Node-B /mnt/backup-ext/dump/
    └─ azcopy push at 3AM daily
         └─ Azure Storage Account: alliancefleetbackup
              └─ Container: alliance-fleet-backups
                   ├─ /dump/     (VM and LXC vzdump files)
                   └─ /postgres/ (pg_dump exports from Home One)

Azure storage configuration:

The push script at /usr/local/bin/azure-backup-push.sh:

#!/bin/bash
LOGFILE="/var/log/azure-backup.log"
SAS_TOKEN="<token-from-1password>"
DEST_BASE="https://alliancefleetbackup.blob.core.windows.net/alliance-fleet-backups"

echo "$(date): Starting Azure backup push" >> "$LOGFILE"

/usr/local/bin/azcopy copy \
  "/mnt/backup-ext/dump/" \
  "${DEST_BASE}/dump/${SAS_TOKEN}" \
  --recursive \
  --log-level INFO >> "$LOGFILE" 2>&1

echo "$(date): Push complete" >> "$LOGFILE"

Cron entry in root’s crontab on QCM1255:

0 3 * * * /usr/local/bin/azure-backup-push.sh

The Bugs That Nearly Killed the First Run

Getting this pipeline to a clean 3AM completion took several sessions of debugging. The bugs are worth documenting because they are the kind that do not show up in any tutorial.

Bug 1: Heredoc backslash line continuation. The first version of the script used backslash line continuations inside a heredoc. The shell interpreted --recursive \ and --delete-destination \ as standalone commands rather than continuation flags. The fix is to put all azcopy flags on single lines, no backslash continuations inside heredocs.

Bug 2: SSH session drops killing the job. Running azcopy in a foreground SSH session meant any connection drop killed the process mid-transfer. The fix is tmux. Run tmux new -s backup, start the job inside the session, detach with Ctrl+B D. The job survives the SSH drop.

Bug 3: Cron PATH issue. The cron job ran but reported azcopy: command not found. Cron runs with a minimal PATH that does not include /usr/local/bin. The fix is to use the full path in the cron script: /usr/local/bin/azcopy instead of azcopy.

Bug 4: Duplicate cron entry. After fixing the PATH issue, there were two cron entries. One pointing to /usr/local/bin/azure-backup-push.sh and an old one pointing to /opt/scripts/azure-backup-push.sh that had never been cleaned up. The second one ran, failed silently, and logged nothing because the path did not exist. Removed with crontab -e.

Bug 5: Grafana disk alert at 92-93%. After the first successful push, the disk alert fired. The issue was VM 201 (Tantive-III) generating 67GB vzdump snapshots daily and azcopy holding open file handles on deleted backups. The deleted files showed up in lsof | grep deleted | grep azcopy and were not releasing the space until azcopy exited. The fix: kill the azcopy process to release the file handles, then remove Tantive-III from the backup job as described above.

The first clean 3AM run: May 13, 2026. Completed in approximately 4 hours with zero failures. WAN upload speed on Spectrum (~40-50 Mbps sustained) is the bottleneck, not the pipeline.


Why azcopy Over Other Tools

A few alternatives were considered before landing on azcopy.

Rclone is the most common recommendation for syncing to cloud storage from Linux. It works. The reason azcopy won is that Microsoft maintains it specifically for Azure Blob Storage, it handles large file transfers and resumable uploads natively, and it authenticates directly via SAS token which is what was already configured. No extra configuration layer between the tool and the storage account.

Azure Backup (the managed service) was considered and rejected. For a homelab, paying for the Azure Backup vault and managed agent adds cost and complexity that azcopy eliminates. The goal is an offsite copy of vzdump files, not a full managed backup service.

B2 or Backblaze was in the original plan before the decision to consolidate everything in Azure given the AZ-104 study path and the existing Azure tenant for the homelab.


SAS Token Note

The SAS token must include the leading ? when appended to the destination URL. Without it, azcopy returns an authentication error that looks like a permissions issue rather than a URL formatting issue. The token is stored in 1Password under “Azure Storage - alliancefleetbackup” and expires April 15, 2027. Rotation runbook: generate a new token with the same permissions in the Azure portal, update the script, update 1Password.


← Back to Holocron Logs