zebu

In your pool, backing up your stuff


All hardware will eventually die, and without care data will die with it. As my home file server has grown, I considered a variety of technologies to ensure my data's safety before settling on FreeBSD's ZFS. ZFS provides many strong data protection features (e.g., RAID-Z and RAID-Z2 for drive redundancy, strong checksumming, snapshots, journalling, etc.) ZFS also provides send and receive semantics for moving snapshots around, but these primitives aren't well-integrated. I wrote Zebu as a simple, small-scale backup system to leverage these primitives in ZFS.

General Design

Zebu is a minimalistic system, intended to be run from cron. The single command, zebu, will process data in three phases: snapshot, cleanup, and transmit. Zebu operates over ZFS filesystems (and can optionally recursively descend through sub-filesystems). Configuration is driven from a central configuration file (/etc/zebu/zebu.conf, by default). zebu.conf lists global parameters, as well as a configuration stanza for every target ZFS filesystem.

During the snapshot phase, zebu will create (optionally recursively) snapshots in the configured filesystem(s) named zebu-<timestamp>. Before I developed Zebu, I used a system called Dirvish to backup remote machines using rsync over ssh. Zebu can optionally use rsync to update data in the ZFS filesystem before creating a snapshot, providing similar functionality. Much like Dirvish, Zebu supports a list of files (regular expressions, really) to exclude from the rsync; both global and filesystem-specific exclude lists are allowed, and indicated in zebu.conf. Once the rsync completes, and appropriate logs are written, zebu will snapshot the ZFS filesystem.

Obviously, the cleanup phase will remove old snapshots. Each zebu-created snapshot will contain a timestamp in its name, so Zebu can merely compare ZFS snapshot names with the configured expiration time, and destroy old snapshots. zebu will never remove the last (most recent) snapshot, just in case something goes awry in backup processing.

The transmit phase pipes the output of zfs send into the configured transmission command. zebu will recursively descend over child filesystems (barring configuration to the contrary), and send each individually, rather than use a recursive ZFS send. Recursive sends are not supported in early versions of ZFS (in FreeBSD 7.x), and will copy filesystem attributes as a side-effect. Since Zebu doesn't copy over filesystem attributes, it's possible for the source filesystem to be available via NFS and uncompressed, but the destination to not advertise NFS and use gzip - generally a desirable trait. Unfortunately, this can lead to some issues if the transmit phase is interrupted (see below).

Suggested Usage

Consider two servers - a primary and a backup. Zebu is designed to run from cron on both of these, performing all three phases on primary and only cleanup on the backup (though zebu can also be used on the backup server, to snapshot and transmit its system-local files to yet another machine, or back to the primary).

For example, here's a configuration similar to what I use on my primary file server:

[DEFAULT]
basepath=/
excludes=/etc/zebu/excludes
expiretime=30:0:0:0
rsync_path=/usr/local/bin/rsync
transmit_cmd=/usr/bin/ssh -x -qT -l root backup "/sbin/zfs recv -F -d pool"
lockfile=/tmp/zebu.lock

[pool/backup/archive]
recurse=yes
doTransmit=yes

[pool/backup/time_machine]
recurse=yes
doTransmit=no

[pool/backup/linode]
rsync_server=linode.example.com
doTransmit=yes

[pool/homes]
recurse=yes
doTransmit=yes

On the backup server, you can use a similar config file (to handle regular cleanups, and local filesystems):

[DEFAULT]
basepath=/
excludes=/etc/zebu/excludes
expiretime=30:0:0:0
rsync_path=/usr/local/bin/rsync
transmit_cmd=/usr/bin/ssh -x -qT -l root primary "/sbin/zfs recv -F -d pool"
lockfile=/tmp/zebu.lock

[pool/backup/archive]
recurse=yes
doTransmit=no
doSnapshot=no

[pool/backup/linode]
doTransmit=no
doSnapshot=no

[pool/homes]
recurse=yes
doTransmit=no
doTransmit=no

[pool/local]
doTransmit=yes

These configs will result in several filesystems (pool/backup/archive, pool/backup/homes, and pool/backup/linode) getting snapshotted on primary, then transmitted to backup. pool/backup/linode will see an rsync from linode.example.com (a remote host being backed up) before snapshots are taken. The corresponding config on the backup server will ensure old snapshots are cleaned out there as well (lest the primary server's transmit phase cause them to accumulate). Additionally, the backup server will transmit pool/local over to the primary server. Note that without a pool/local stanza in the primary server's config, it's likely that snapshots from this filesystem will accumulate indefinitely.

Limitations and Known Issues

Since Zebu recursively descends filesystems itself during transmit, a transmit operation (unlike snapshot or clean) is not atomic. Unfortunately, Zebu cannot currently clean up well from a failure during transmit. If the sender or receiver processes (or machines) die during a recursive transmit, some child filesystems will have been transmitted while others have not. Zebu will merely re-try all transfers on the next run, and will probably encounter errors copying some of the child filesystems. Currently, there is no logic to handle an error in a child transfer; this will just appear as a failure of the transfer phase for the parent filesystem, and Zebu will be unable to transfer any data for that filesystem until the situation is manually rectified.

No attempt is made to replicate filesystem options (e.g., zfs get all). It's unlikely this will be added, since it's not always clear what the expected behavior should be on the backup host.

All error reporting is handled via stdout and stderr. I intended to run zebu via cron, and output from my cron jobs goes somewhere. Your mileage may vary.


Random bits of code

All code is copyright Mike Shuey, and licensed under GPL version 2.

Source tarball zebu-1.0.0.tar.gz

$Date: 2012/01/12 21:32:37$