Backup System

tk: The backup system has some weird stuff from the previous setup.

For example: /external/ on oldbox is an internal hardrive. It should be called /archive/ to match officebox.
The datapool on officebox that stores /archive/ is called internalpool. This is a weird name, which was original used to distinguish is from the external harddrive.
It should be called archivepool.
Add referenceto Soren’s RGEs.
The airgap stuff sounds weird. “Offline” sounds much better than “airgap”.
The airgap script is much more sophisticated than the offsite backup script. Update the offsite backup script to match.
No mention of tailscale.

I’ve been tinkering away at a backup system since 2022-02-15. At that time, my backup “system” was a mess. I had an external hard drive with a loose collection of backup files scattered over the past few years. I am not sure what is in any of the backups. How do I restore any of this? Who knows? I didn’t document anything when I backed this stuff up.

google-drive-backup.zip
lightbox-backup-2021-07-06
loopland-backup-2020-10-16
meg-photo-backup-2020-07-05
officebox-backup-2021-07-16
pgadey-website-syncthing-backup-2021-03-27
skatolo.backup-2019-11.tar.gz
work-syncthing-backup-2021-03-27
zenbook-backup-2020-03.tar.gz
zenbook-backup-2021-07-07

So, I’m writing up some documentation about how my current system works. This write-up has two purposes: to help me clarify my thinking about the system, and to explain to the reader how to setup such a system. Setting up this backup system, and writing all my own tooling around it, was very educational and I encourage any Linux beginner to undertake a similar project.

ZFS in Ten Seconds

The whole backup system is based on ZFS a file system developed by Sun Microsystems in the early 2000s. Many people have called it “the last word in filesystems”. It has a lot of amazing features, but my backup system only uses two of them:

Snapshots : ZFS can make immutable copies of datasets very quickly and at low overhead cost. These are the basis of the differential backup system that I use.
Send/Receive : ZFS can send a dataset as stream that can be piped through ssh to move snapshots between computers. This is how backups moved around my backup system.

Archival, Offsite, and Offline Backups

There are three types of backups that I store: archival, offsite, and offline. The archival copies of data are stored on my main desktop computer officebox which is located at work. The archival record gets updated every fifteen minutes. I’ve only ever used it to restore files that get deleted accidentally while working.

To protect against a disk failure (or worse) on officebox, I have offsite backups that are stored on a computer oldbox in the basement of my apartment. These get updated on a weekly basis.

Finally, to protect against both officebox and oldbox failing, I have offline backups that I keep with my family on the East Coast. These will get updated on an annual basis.

The Current Setup

There are six major components to the system:

File hierarchy
Automatic snapshots
Offsite Backups
Offline Backups
Scrubbing, logging, and notification

This might sound somewhat sophisticated, but it is actually a bunch of hack-y bash scripts. It relies heavily on cron even though cron has been deprecated since pre-historic times.

File Hierarchy

Soren Bjorstad has a nice post on Getting Your Filesystem Hierarchy Less Wrong. Really thinking through my file hierarchy and deciding what needed to be backed up and archived was a whole task in itself.

officebox has a datapool mounted to /archive/ that contains the following:

officebox:/archive/
├── backup-log.txt
├── Family
├── Google
├── Hugo
├── Music
├── Personal
├── Phones
├── Pictures
├── PROCESS
├── Servers
├── Trips
├── Video
└── Work

oldbox has a similar hierarchy mounted to /external/ which contains:

oldbox:/external/
├── Dotfiles
├── Family
├── Hugo
├── Music
├── Personal
├── Phones
├── Pictures
├── Servers
├── Trips
├── Video
└── Work

It is worth noting that Dotfiles is present on oldbox but not on officebox. For the details of why this is, see how I got Locked Out of My Dotfiles and other highlights in the Blooper Reel of False Starts.

Automatic Snapshots

The automatic snapshots are done on officebox by zfs-autosnapshot. These different frequencies of snapshots are run as cron jobs.

 /etc/cron.d/zfs-auto-snapshot
 /etc/cron.daily/zfs-auto-snapshot
 /etc/cron.hourly/zfs-auto-snapshot
 /etc/cron.monthly/zfs-auto-snapshot
 /etc/cron.weekly/zfs-auto-snapshot

A typical example looks something like the following script for daily backups:

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=daily --keep=31 //

The trailing double slash means that zfs-auto-snapshot will snapshot all zfs datapools. The --keep=31 flag means that it will keep a rolling list of thirty one snapshots. In principle one could ask for more granularity. This script produces backup snapshots named like the following:

/archive/.zfs/snapshot/zfs-auto-snap_daily-2022-09-04-1136

It turns out that snapshotting all zfs datapools made some things difficult. For example, I wanted to plug in an external harddrive with a ZFS datapool on it and use it without adding to the external’s snapshot chronology. So, I had to switch zfs-auto-snapshot to only snapshot internalpool.

Now, the daily backup zfs-auto-snapshot script looks like this:

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

# create a rotating daily backup
exec zfs-auto-snapshot --quiet --syslog --label=daily --keep=31 internalpool

# create the common ancestor snapshot for offsite and offline backups
exec zfs-auto-snapshot --quiet --syslog --prefix=offsite --label=daily internalpool

This produces an additional series of daily backups intended to synchronize with the offsite backup. They are named as follows:

internalpool@offsite_daily-2022-11-24-0736

This insight about different retention policies and common ancestors came from this Stackexchange post: https://unix.stackexchange.com/questions/289127/zfs-send-receive-with-rolling-snapshots

Offsite Backups

The offsite backups are managed by a script:

 #!/bin/bash
 
 # run this script on the offsite backup box
 
 # local and remote are a little weird here: LOCAL has the "archival" copy of the data, REMOTE has the "offsite" backup.
 LOCAL="ts-officebox"
 REMOTE="ts-oldbox"
 
 OffsiteBackupPool="offsitepool"
 
 CommonSnapshotPrefix="offsite_daily"
 
 MostRecentLocal=$(ssh $LOCAL zfs list -t snapshot -o name -s creation | grep $CommonSnapshotPrefix | tail -n1)
 
 # Get the snapshot name of the most recent remote snapshot by stripping the pool name before the @ symbol.
 # For example: "offsitepool@offsite_daily-2022-10-31-0817" --> "offsite_daily-2022-10-31-0817"
 MostRecentRemote=$(ssh $REMOTE zfs list -t snapshot -o name -s creation | grep $CommonSnapshotPrefix | tail -n1 | cut -d'@' -f 2)
 
 ssh $LOCAL \"zfs send --verbose -I $MostRecentRemote $MostRecentLocal\" | pv | sudo zfs receive -F $OffsiteBackupPool

This script performs a differential backup of the archival dataset to the offsite backup. First, it looks at the most recent snapshot of the archival dataset intended to be sent for offsite backup. It then looks for the most recent snapshot in the offsite dataset. The script is supposed to be run on the computer storing the offsite backup. It asks the computer storing the archival copy of the data to send all intermediary snapshots needed to fill in the gap between the archival and offsite datasets.

Offline Backups

The offline backups are managed by the following script:

 #!/bin/bash
 
 OfflineDevice="/dev/disk/by-uuid/17826005912122203015";
 
 LocalPool="offsitepool";
 OfflinePool="basementpool";
 
 CommonSnapshotPrefix="offsite_daily";
 
 MostRecentDaily=$(zfs list -t snapshot -o name -s creation $LocalPool | grep $CommonSnapshotPrefix | tail -n1) ;
 MostRecentDailySnapshot=$(echo "$MostRecentDaily" | cut -d'@' -f 2);
 echo "Most recent local  daily snapshot:  $MostRecentDaily";
 
 echo "Importing all pools on offline storage: $OfflineDevice.";
 zpool import -a -d $OfflineDevice;
 
 echo "Checking on the status of $OfflinePool:";
 zpool status $OfflinePool;
 
 # Get the snapshot name of the most recent offline snapshot by stripping the pool name before the @ symbol.
 # For example: "offsitepool@offsite_daily-2022-10-31-0817" --> "offsite_daily-2022-10-31-0817"
 MostRecentOffline=$(zfs list -t snapshot -o name -s creation $OfflinePool | grep $CommonSnapshotPrefix | tail -n1 | cut -d'@' -f 2);
 echo "Most recent offline daily snapshot: $MostRecentOffline"
 
 if [ "$MostRecentDailySnapshot" = "$MostRecentOffline" ]; then
 	echo "The most recent snapshots match: $MostRecentDailySnapshot."; 
 else
 	echo -e "The most recent snapshots do not match:\n\t Daily: $MostRecentDailySnapshot \n\t Offline: $MostRecentOffline";
 fi
 
 # Show a dry-run of the proposed backup
 zfs send --dryrun --verbose -I $MostRecentOffline $MostRecentDaily
 
 echo -e "And now the moment we've all been waiting for ... \n";
 echo "zfs send --verbose -I $MostRecentOffline $MostRecentDaily | pv | zfs receive -F $OfflinePool";
 
 read -p "Do you want to run the command? y[n]" answerBackup
 answerBackup=${answerBackup:-"N"};	
 
 case $answerBackup in
   [Yy]* ) 
 	  echo "Yes! Let's do this thing." ;
 	  zfs send --verbose -I $MostRecentOffline $MostRecentDaily | pv | zfs receive -F $OfflinePool;
 	  answerBackup="y" ;;
   [Nn]* ) 
 	  echo "No! It doesn't seem right." ;
 	  answerBackup="n" ;;
 esac
 
 echo "Exporting $OfflinePool."; 
 zpool export $OfflinePool ;

Scrubbing, Logging, and Notification

This job sends me an e-mail to let me know that the scrub has started.

To make sure that things are running correctly, I need some way for the system to notify me regularly that everything is okay. The script /etc/cron.weekly/zfs-scrub-and-email is run weekly to start a scrub of the datapool which contains all the snapshots. To verify the integrity of the backups, and make sure that things are still running smoothly, I have a cron job that scrubs the datapool. It does three things: checks the status of the pool, e-mails me to let me know that the scrub started, and makes a short note in the backup-log.txt.

The way that it e-mails me is particularly hacky. It uses my account on ctrl-c.club to send an e-mail, because getting my office computer to send e-mail automatically is next to impossible.

 #!/bin/bash
 HOSTNAME=$(hostname -s)
 POOL="internalpool"
 EMAIL="parkerglynnadey@gmail.com"
 
 # check current the status
 STATUS=$(/usr/sbin/zpool status $POOL)
 
 # use ctrl-c.club to send the mail for you
 ssh pgadey@ctrl-c.club -i /home/pgadey/.ssh/id_rsa -x "echo -e \"Subject: $HOSTNAME: scrub STARTED on $POOL\n\n$STATUS\" | sendmail -t $EMAIL"
 
 # scrub the pool
 /usr/sbin/zpool scrub $POOL
 
 # make a log entry
 echo "$(date --iso-8601=seconds) : started scrub of $POOL on $HOSTNAME." >> /archive/backup-log.txt

Scrubs can take a variable amount of time, and so there needs to be some mechanism for notifying me that the scrub finished. zed, the ZFS Event Deamon, automatically runs scripts which match /etc/zfs/zed.d/scrub_finish-* whenever a scrub finishes. By suitably modifying the script above and placing it at /etc/zfs/zed.d/scrub_finish-notify-by-email.sh, I am able to get notifications when scrubs finish. This script also sends me the tail of backup-log.txt.

 #!/bin/bash
 HOSTNAME=$(hostname -s)
 POOL="internalpool"
 EMAIL="parkerglynnadey@gmail.com"
 
 # check current the status of the zpool
 STATUS=$(/usr/sbin/zpool status $POOL)
 
 # append a bit of the back-up.log for good measure
 STATUS="$STATUS \n \n backup-log.txt: \n $(tail -n 5 /archive/backup-log.txt)"
 
 # hackily use ctrl-c.club to send the mail for you
 ssh pgadey@ctrl-c.club -i /home/pgadey/.ssh/id_rsa -x "echo -e \"Subject: $HOSTNAME: scrub FINISHED on $POOL\n\n$STATUS\" | sendmail -t $EMAIL"
 
 echo "$(date --iso-8601=seconds) : scrub finished on $POOL on $HOSTNAME." >> /archive/backup-log.txt

Archive the servers

To keep backups of my home directory on a couple remote servers, I have a script /etc/cron.weekly/archive-servers.sh that runs rsync in archive mode. This script populates the directory /archive/Servers/.

#!/bin/bash

# this backup will be perfomed by root@officebox
# so, use -rsh to setup ssh to act like pgadey@officebox
RSH="ssh -F /home/pgadey/.ssh/config -i /home/pgadey/.ssh/id_rsa"

# archive pgadey.ca
rsync --archive --verbose --compress \
        --rsh="$RSH" \
        pgadey@cloudbox:/home/pgadey \
        /archive/Servers/pgadey.ca

# etc. etc. for various servers

echo "$(date --iso-8601=seconds) : Servers (pgadey.ca, etc.) archived." >> /archive/backup-log.txt

How to Use This Setup

This backup strategy is rather hands on. And there are elements that I only do once in a long while. For example, I only update the offline storage about once per year. And so, I usually can’t remember how to use the setup when it comes time. Here are the instructions that I’ve left for myself.

How to Setup A New Disk

It is helpful to have lots of copies of your backups. These are the steps that I take to setup a new external hard drive and get a copy of the data on to it.

If you’ve got a fresh hard drive, right out of the box, follow these steps. Check that the power on the external hard drive enclosure is turned off. Put the hard drive in to the bay. Turn the enclosure on.

 # create a blank partition table
 sudo gparted

 # Find the name of the new device.
 # Gparted --> Devices --> etc.

 # Click through:
 # Device --> Create Partition Table
 # Select: "new partition table: msdos"

 # create a new unformatted partition

 # Click through:
 # Partition --> New
 # Select: "filesystem: cleared"

 # Find the new device and note down its UUID
 blkid 

 # some representative values
 POOL="minipool";
 UUID="/dev/disk/by-uuid/4075467478855155972";

 # create a new pool on the drive
 # the flag is needed to force overwriting the existing ext4 partition
 sudo zpool create -f $POOL /dev/disk/by-uuid/$UUID 
 sudo chown -R pgadey:pgadey /$POOL

 LocalPool="offsitepool";
 OfflinePool="basementpool";

 # Find the earliest "offsite_daily" snapshot
 zfs list -t snapshot -o name,creation -s creation internalpool | grep offsite_daily | head

 # send it to the new drive
 # (this operation will take a long while )
 sudo zfs send internalpool@offsite_daily-2023-03-28-0738 | pv | sudo zfs receive -F $POOL

 # Perform a offline-backup.sh
 # (see details below)

Once the pool is exported, I put a little label on the physical drive. If I ever need to access the data again, the label has everything I might need to get up and running quickly.

Parker Glynn-Adey
https://pgadey.ca

POOL="minipool" 
UUID="4075467478855155972"

minipool@offsite_daily-2023-03-28-0738

An External Backup with Label

How to Make An Offline Backup

About once a year, I update the offline backup.

To update the offline backup: First, plug in the external hard drive while the machine is running. Otherwise, the zfs filesystem on an external hard drive will mess up the boot process. Second, the headers of the offline-backup.sh script to make sure you have the right UUIDs and drives. Third, run the following commands. (You can skip the zpool status on either side of the backup, but it’s nice to check.)

sudo zpool status
sudo /home/pgadey/bin/offline-backup.sh
sudo zpool status

The command offline-backup.sh will export the drive for you once the backup is complete. This means that you can unplug the external hard drive, and return it to storage.

The Blooper Reel of False Starts

Backup Software That Didn’t Work for Me

A Ticking Clock

At one point, I had a backup setup similar to the system described above. However, it involved an external drive in a harddrive bay on my desk. It looked pretty rad; I enjoyed seeing the exposed hard drive. Every fifteen minutes, zfs-autosnapshot would make a snapshot and I would hear the drive click in to action. This was a nice reminder that the system was working. So, for a couple weeks the drive sat there happily clicking every fifteen minutes.

I only noticed that the external hard drive wasn’t storing any data when I needed to recover a backup. It had been mounted to the wrong mountpoint for weeks, and nothing had been written to it!

Locked Out of My Dotfiles

During another iteration of this backup system, I had my dotfiles stored in the archival datapool. ZFS had issues with mounting an external drive at startup, so I had to boot up the system with no dotfiles (and no notes!) and mount the external manually.
This was a huge hassle, and prompted me to separate out my dotfiles from the archival datapool.

Eventually, I switched away from keeping the archival datapool on an external harddrive to avoid these mount issues at boot time.

`stow` Behaves Weird with This Setup

It turns out that stow does not like complicated symbolic link situations. Right now, I have ~/Dotfiles symlinked to /archive/Dotfiles/ and this messes stow up. The fix is to clearly lay out how the source “dir” and target are related to each other.

 stow --dir=/archive/Dotfiles --target=/home/pgadey --verbose=2 PACKAGE

A helpful script for automating this is:

 cd /archive/Dotfiles

 for name in */; 
 do stow --dir=/archive/Dotfiles --target=/home/pgadey --verbose=2 "$name"; 
 done;

Check the logs!

While I was writing up this post, I decided to check /archive/backup-log.txt to see how things were looking. It turned out that the script for archiving servers had not run in months. A bit of digging turned up the fact that the permissions were set wrong on the script, and it never ran. This prompted me to include a bit of the log in every notification of a completed scrub.

Planned Improvements

Insights from Backups

Granularity

The zfs-auto-snaphot setup that I use maintains a rolling list of thirty one daily snapshots and an unlimited amount daily snapshots for the sake of offline and offsite backups . In principle, zfs is so efficient that it wouldn’t use up much more disk space to keep an unlimited number of daily snapshots. However, I find that that amount of granularity is confusing. Whenever I’ve needed to look something up in the backups, I usually remember which month it was but have only the foggiest notion of which day it was.

A Human in the Loop

I think that it’s important to be aware of how the backup system is running, and not have blind faith that it will work on its own. Ultimately, this backup system is a bunch of bash scripts that I hacked together. It is not enterprise grade software by any means. So, I like to check-in from time to time and see what is happening.

The weekly e-mails from the scrubs are just enough notification to make sure that I don’t forget what is going on.

Keeping a human in the loop is a concept from AI research that I think fits with this system.

Time Travel

When Have I Used These Backups?

I think that it is worth keeping a record of how and when one uses their backups. This gives a sense of the actual use cases for restoring data.

[2023-01-05] Cleaned up this website a bit, and deleted a folder flippantly. Later on, when I wanted to up date the change log, I had to look up some dates in the deleted files.
[2023-03-06] Revised some solutions to an old problem set and got a bit carried away with the notational changes. I pullef the last weekly snapshot zfs-auto-snap_weekly-2023-02-28-1244 and restored back most of what I’d mangled.
[2023-03-27] Lisa reached out saying that her student’s recording wasn’t on the seminar page. However, I was sure that I added it at some point. However, the link was gone! Looked through some old seminar pages using: ls /archive/.zfs/snapshot/*2023-03*/Hugo/pgadey/content/seminar.md Eventually found one with the desired links are re-instated it.
[2023-04-28] I thought to myself: “Using git branches to try out various Hugo themes would be great!” Turns out, I got confused somehow and broke both versions of my website. Luckily, I had backups! I nuked the broken versions and restored from a backup (/archive/.zfs/snapshot/zfs-auto-snap_hourly-2023-04-28-1717/)
[2023-12-02] While fiddling with my website, I broke the KaTeX rendering support. No worries, just revert to a previous commit. Nope. The headers for KaTeX rendering are stored in a submodule, and I’m too tired to learn about submodules. So – I dug out a backup copy from the oldbox backup. Oddly, the officebox archive doesn’t keep up to date with Hugo. This suggests something is wrong with Syncthing.
[2024-07-02] I tried to use the offline backup script, and it totally didn’t work! What’s worse, I have no idea how to access my offline backup. I’m not even sure what the external drive is supposed to be doing. It’s been that long since I looked at this part of the system.
[2024-07-03] Nevermind the stuff from yesterday? It seems that sudo zpool import found the disk and mounted the pool just fine. After that, running sudo /home/pgadey/bin/offline-backup.sh worked without issue.
[2024-08-21] Turns out, syncthing wasn’t synching my data for a while and I had no idea. This left me in an awkward spot when I had to work from home for a week before classes start.

Backup System

ZFS in Ten Seconds

Archival, Offsite, and Offline Backups

The Current Setup

File Hierarchy

Automatic Snapshots

Offsite Backups

Offline Backups

Scrubbing, Logging, and Notification

Archive the servers

How to Use This Setup

How to Setup A New Disk

How to Make An Offline Backup

The Blooper Reel of False Starts

Backup Software That Didn’t Work for Me

A Ticking Clock

Locked Out of My Dotfiles

`stow` Behaves Weird with This Setup

Check the logs!

Planned Improvements

Insights from Backups

Granularity

A Human in the Loop

Time Travel

When Have I Used These Backups?

Helpful Reads

Meta

Menu

Backup System

ZFS in Ten Seconds

Archival, Offsite, and Offline Backups

The Current Setup

File Hierarchy

Automatic Snapshots

Offsite Backups

Offline Backups

Scrubbing, Logging, and Notification

Archive the servers

How to Use This Setup

How to Setup A New Disk

How to Make An Offline Backup

The Blooper Reel of False Starts

Backup Software That Didn’t Work for Me

A Ticking Clock

Locked Out of My Dotfiles

stow Behaves Weird with This Setup

Check the logs!

Planned Improvements

Insights from Backups

Granularity

A Human in the Loop

Time Travel

When Have I Used These Backups?

Helpful Reads

Meta

Menu

`stow` Behaves Weird with This Setup