On making backups

So, awhile back I wrote about external hard drives, partly because I was interested in performing backups. I also noticed some speed increases when I did backups to my external hard drive, so I decided to look into performing regular backups onto it in additional to the semi-regular backups I do onto DVDs.

Why back up to an external hard drive?

I don't have to waste a DVD every time perform a backup, especially if I am making backups on a daily basis. Plus, the same amount of data can be backed up to my external hard drive in less time.

Note that I still back up to DVDs, since those are more durable and can easily be taking offsite. But those backups are done every few weeks at best, so backups to my external hard drive are done more frequently -- usually every few days.

What is backed up?

Various documents, my photography (I take a lot of pictures), source code for projects I am working on, my Moneydance financial data, and tarred/gzipped backups of websites that I manage.

What is not backed up?

Any movies and music that I have--since those files are static (i.e., they never change), I just burn them to DVD when I have enough content to actually fill a DVD. There's simply no need for me to keep backing them up over and over. Sure, it's nice to have multiple copies of this stuff, but I simply do not have that much of a need for my MP3s. (And it's not like I can't rerip them from my CD collection) Any pictures that are older than a year old are also burned to DVD and removed from my Pictures/ directory, since I am no longer working with them on a day-to-day basis.

Full directory structures, or tarballs?

For external hard drives, I discovered that copying full directory structures (i.e., cp -r wildcard dest) doesn't work out so well. First, there is overhead for writing each file and directory at the destination, and second, external hard drives don't work out so well with large numbers of files. After I made a few file-based backups to the external hard drive, I discovered that when I connected it, mdimport would run for a full minute while the entire directory structure was not traversed. Not fun.

I found creating tarballs to be a superior choice here, since only a single file is opened for writing, and it is just a bunch of data that keeps getting appended to it. Very simple and efficient. While getting individual files from the tarball might be a bit complicated, I do not anticipate having to do this often. And I always have the option of just extracting the entire tarball to a tmp directory on my machine and picking through the files.

The hardware and software

Just to clarify what kind hardware and software I'm using for this little experiment:

- iMac G5 20" 2.0 Ghz
- 1 GB of RAM
- 250 Gig (really 232.89 GB) internal IDE drive
- OS/X 10.4.10
- External drive: Firewire Interface: OWC Mercury Elite, 76.69 GB

The results!

I wrote a shell script to perform backups. It processed command-line options (such as whether to use compression, and the destination directory), created flags for the tar command, then ran the tar command under the UNIX time command, which measures wall-clock seconds, as well as program seconds (time that the program spent executing) and system seconds (time spent in system calls).

For my tests, my variables included both the destination directory (my home directory versus a directory on the removable drive) and whether compression was used or not. Here are the results:

Compression with Gzip No compression
Home directory Time:
13m40.982s (real)
7m21.610s (user)
1m3.942s (sys)

Size: 3.09 GB

Time:
7m2.057s (real)
0m3.260s (user)
0m52.204s (sys)

Size: 3.51 GB

External hard drive Time:
11m41.163s (real)
7m16.400s (user)
1m1.679s (sys)

Size: 3.09 GB

Time:
3m53.975s (real)
0m2.793s (user)
0m41.010s (sys)

Size: 3.51 GB

Conclusions

The better performance on the external hard drive can be explained by understanding what is happening on each hard drive. When I am backing up to a tarball in my home directory, the tarball is being written to the same drive that files are being read from. Since you cannot read and write on the same hard drive at the same time, it was impossible to write the tarball unless no reading was being done at the same time. While modern OSes have gotten very good at using caching and scheduling disk activity in idle periods, the OS can only do so much. Contract with backing up to an external hard drive, which resulted in a scenario where only reads were done on the internal drive and only writes were done on the external drive.

For the type of data I was backing up, trying to compress it ended up being a big time waster. This is obvious by looking at the difference in user time. For both the local and external hard drives, using compression resulted in over 7 minutes of execution. And the space savings was a mere 0.42 GB, or just over 10% of what the uncompressed tarball was.

Back when I first got fed up with Retrospect and tried making tarballs of my data, I originally used the method of compressed tarballs in my home directory, with occasional backups to DVD. But based on this, it looks like uncompressed backups to my external hard drive is going to be the way to go from now on.

The shell script

Finally, if you made it this far, I might as well share the shell script that I used for running these tests.

#!/bin/sh
#
# Perform a backup of our stickies and our system
#

set -e

#
# What directory will the file go into?
#
DIR=$HOME

DATE=`date +%Y%m%d%-%H%m%S`
if test ! "$HOSTNAME"
then
	HOSTNAME="dmuth.local"
fi

#
# Parameters that can be specified on the comamnd line
#
P_VERBOSE=""
P_COMPRESS=""
P_TARGET=""

#
# Print out the program's syntax
#
function print_syntax() {
	echo "Syntax: $0 [--verbose] [--compress] [target directory]"

} # End of print_syntax()


#
# Parse our arguments and populate config variables
#
function parse_args() {

	while test "$1"
	do
		CURRENT=$1
		shift

		if test "$CURRENT" == "--verbose"
		then
			P_VERBOSE=1

		elif test "$CURRENT" == "--compress"
		then
			P_COMPRESS=1

		elif test "$CURRENT" == "--help"
		then
			print_syntax
			exit

		elif test "$CURRENT" == "-h"
		then
			print_syntax
			exit

		else 
			P_TARGET=$CURRENT

			#
			# Check our target for sanity
			#
			if test ! -d "$P_TARGET"
			then
				echo "$0: Target '$P_TARGET' is not a directory!"
				exit 1
			fi

			if test ! -w "$P_TARGET"
			then
				echo "$0: Target '$P_TARGET is not writable!"
				exit 1
			fi

		fi

	done

	#
	# If not specified, assume that the home directory is writable
	#
	if test ! "$P_TARGET"
	then
		P_TARGET=$HOME
	fi

} # End of parse_args()


#
# Get the flags for our tar command.
# They are printed out, so this funciton should be called via the backtick
# operators so that the output can be captured.
#
function get_tar_flags() {

	if test "$P_VERBOSE"
	then
		echo -n "v"
	fi

	if test "$P_COMPRESS"
	then
		echo -n "z"
	fi

} # End of get_tar_flags()


#
# Main program
#
parse_args "$@"
#echo "TEST: Verbose: $P_VERBOSE, Compress: $P_COMPRESS, Target: $P_TARGET"


#
# Backup our stickies, since they don't work nicely with symlinks.
#
cp $HOME/Library/StickiesDatabase $HOME/Data/Stage1/Library

#
# Our target file
#
TARGET=${P_TARGET}/${HOSTNAME}-${DATE}.tar.gz

#
# Make the tarball
#
cd $HOME

#
# Get our tar flags
#
FLAGS=`get_tar_flags`

#
# Our source folders to back up
#
SOURCES="Data local"

#
# Run the tar command inside of time so we know how long things took.
#
time {
	#
	# We're not creating the tar command ahead of time because of issues I
	# had with quotes and spaces in the target name.
	#
	tar cf${FLAGS} "${TARGET}" ${SOURCES} || true
}