Backing up a running virtual machine

$/!\$ This article is work in progress. Please feed back any suggestions regarding the article or the technique.

<<TableOfContents: execution failed [Argument "maxdepth" must be an integer value, not "[maxdepth]"] (see also the log)>>

1. Introduction

This article explains how to backup a virtual hard disk to a remote location, even while it is use.

One advantage of running software in a virtual machine is that the entire disk can be backed up in one go, including Operating System, software, configuration files, registry, permissions, data and all. Re-establishing a system after a failure is therefore quicker and more reliable than re-installing software and restoring data.

The method employs LVM to take a snapshot of the guest disk and then uses rsync to update changes to a previous backup on a remote server. If there is a database server on the guest then it is flushed & locked at the point the snapshot is taken. This method came into use around 2006 following wider availability and awareness of virtualization software, processor enhancements, cheaper faster network bandwidth, and cheaper bigger disks.

Requirements of the backup process include

The process does not interfere with the guest, which remains live throughout
The disk is backed up in a consistent state
Only changed data is transmitted to the remote store for each backup. It is not efficient to transmit the entire disk during each backup when only a small percentage changes between backups.
A variety of guest Operating Systems and File Systems are supported
The process is secure, robust, able to be automated.
Both servers and desktops are supported
No particular desktop manager is required (e.g. Gnome/KDE)

2. Prerequisites

2.1. LVM file system

The LVM (Logical Volume Manager) file system has the ability to create a snapshot of a disk partition which doesn't change while it is being backed up. The snapshot doesn't interfere with the running VM because it is instant.

To minimize the disk space required for the snapshot, store one virtual disk per logical volume. If the guests' disk activity is low and/or you have plenty of disk space then you can store multiple virtual disks per logical volume without the snapshot running out of space.

There are command line and graphical tools available to set up an LVM disk, but documenting them is beyond the scope of this article.

2.2. Virtual machine host

This script was originally written for KVM but there is no direct interaction with the virtualization software itself so it should be able to backup any guest stored on LVM. Similar techniques have been shown to work with other virtualization software such as Xen and VMware.

2.3. Virtual machine guest

All guest operating systems are supported.

If you want to minimize the amount of data transferred to the backup server, then move temporary files and any other files that don't need to be backed up to a second virtual disk.

For KVM, use the -hda and -hdb parameters to specify the disks on the command line:

/usr/bin/qemu-kvm -k en-gb -m 1024 -hda /media/vm01/vm01-hda.raw -hdb /media/archive/vm01-hdb.raw

Note that the second disk does not need to be in the same logical volume as it does not need to be in the snapshot. If fact, excluding it reduces the space used by the snapshot and reduces the impact on performance of the guest during the backup.

The files that can be excluded from the backup depend on the guest configuration, but here are some suggestions:

On Linux

Set the TMPDIR environment variable.
If running PHP, set the upload_tmp_dir and session.save_path variables in php.ini

On Windows

Move the swap file
Set the TMP and TEMP environment variables
If running PHP, set the upload_tmp_dir and session.save_path variables in php.ini

2.4. Guest database access from the host

If the guest is running a database, then a database user account is required to flush and lock the database. It is not advisable to use root as the password must be stored in plain text in the backup script. Furthermore, the user must be granted permission to connect from another machine (the host).

It is advisable to create a dedicated database user account and grant it the minimum necessary permissions. This ensures that if the username and password fall into the wrong hands then they are of limited use.

create user 'backup' identified by 'password-goes-here';
grant lock tables, reload on *.* to 'backup';
flush privileges;

If there is a firewall on the guest, then the appropriate port must be open to allow the host to access the guest database. For mysql, the default port is 3306. The firewall can be configured to restrict access via that port to the host's IP address. The host's firewall typically blocks database access.

2.5. Backup server access from the host

Before the backup can run as a scheduled cron task, ssh access must be configured to connect to the backup server without prompting for a password. Rather than store the password in a file, the backup script uses keys. See the REMOTEKEY definition.

It is recommended that a dedicated account is created on the backup server that has just enough permissions to connect via ssh and access the backup storage. Also, a dedicated key is created for backup use only. Then if the host is compromised, the risk to the backup server is limited.

# Create an unencrypted private key that is used only for backups
ssh-keygen -t dsa -P "" -f ~/.ssh/backup -C "ssh key only for backups"
# Only root can access the key files
chmod 600 ~/.ssh/backup*
# Copy the public key to the backup server
scp ~/.ssh/backup.pub backup@192.168.1.1:
# login to the backup server and add the public key to the authorized keys
ssh 192.168.1.1 -l backup
(enter the password)
cat backup.pub >> .ssh/authorized_keys2
chmod 600 .ssh/authorized_keys2
rm backup.pub
exit
# Now you can login as backup without entering a password
ssh 192.168.1.1 -l backup -i  ~/.ssh/backup

If you are prompted for a password in that last step, then you may find this troubleshooting guide useful: http://sial.org/howto/openssh/publickey-auth/problems/

$/!\$ A private key granting access to the backup server is now stored in root's .ssh folder. It could be argued that the private key should be encrypted with a passphrase. The key can then be loaded by ssh-add to avoid ssh prompting for it. However, ssh-add needs to be re-run after a reboot. The risk of backups failing after a power outage is worse than the risk to the backup server's security, especially after measures have been taken to limit and mitigate the security risk at both ends of the connection. If you disagree, then use a passphrase to encrypt the key and remember to re-add the key after every reboot of the host.

If you want a better understanding of ssh key authentication, this is a good start: http://www.ibm.com/developerworks/library/l-keyc.html

3. Installation

3.1. Install ShellSql

If the guest is running a database then ShellSQL is used to flush and lock the database while the snapshot is taken. Add the RPMforge repository, which contains !ShellSQL, if you haven't already added it.

yum install shellsql

ShellSQL is used to maintain the connection to the guest's database, and consequently maintain the database lock, while the snapshot is taken. Without it, the lock is lost as soon as the SQL lock command is completed.

3.2. Install the backup script

Download the backup script and save it on the host. Edit the backup script, changing the values defined at the top of the file and the parameters of the function calls at the bottom of the script. Grant 'execute' permissions on the script

chmod +x backup.sh

./backup.sh

3.3. Schedule the script

Schedule the backup script to run at the required time and frequency.

Edit the crontab file via the command-line like so:

crontab -e

To run daily at midnight, the crontab entry would look like this.

#M H D M W
0 0 * * * /root/backup.sh > /dev/null 2>&1

4. Unanswered Questions

First of all, Is this method the best way to meet the requirements listed in the introduction?
Would libvirt make any of this easier? How?
It is feasible to encrypt the guest disk in such a way that it cannot be read in the event that the host or the backup server are compromised. What would be the best method?
Some similar scripts posted elsewhere on the web suspend the guest when the snapshot is started. Is that necessary? When the backed up virtual disk image is booted up, it would be as though a power failure had occurred, and systems ought to be able to cope with that.

5. Options for improving Performance

This backup method is a resource hungry task and we are obliged to optimize it. However, there are so many variables affecting performance that methodical performance testing is the only way to answer these questions. Does anyone know a computing undergraduate looking for a project?

Is it more efficient to use raw disks or qcow? Qcow compressed disks are smaller but maybe raw disks change less, because they are not re-organized to improve compression. Maybe a small change might result in a cascade of changes in the compression dictionary. It would be better to have slightly more traffic every day than huge fluctuations.
Which file system is most efficient? Ext3 or xfs or some other? We need to consider the performance of the guest under normal conditions and during the snapshot. We need to consider the rate of change in the virtual disk file as this has a knock on effect on rsync traffic. Here is an informed discussion on the first point.
What are the optimal rsync options? Match the rsync block size to the raw disk sector size? Huge tracts of the file will be unchanged so a large block size might reduce traffic.
Is there a more efficient virtual disk synchronization tool than rsync? It is surely the best tool that is installed on Linux as standard, but is there a better, more specialized tool, optimized for synchronizing large disk images?
How does the frequency of backups affect total resource usage? If it runs hourly I would expect it to use less disk space, much more CPU, maybe twice as much bandwidth.
Is it worth defragging the disk on Windows? Would it make rsync more efficient? If it is done every day and limited to smaller (<64mb) files then is there still a risk of massive re-organization that would need to rsync'd.
Storing the snapshot volume on a separate physical hard drive may improve performance because during the snapshot, every write to virtual disk is matched by a corresponding write to the snapshot.
What is the impact of the backup on the guest's performance? Is the process network or processor bound?

6. To Do

Before locking the database, the script should call an optimization script on the guest to speed up the file transfer. The contents of the optmization script would depend on the guest's operating system and to some extent on the type of application software.
- Process web-stats to avoid backing up web log files
- Empty the recycle bin
- Fill empty space with zeros. (On Windows this can be done with the SysInternals SDelete command.)
Add instructions (or a link to instructions) for installing !CygWin/openssh on a Windows guest, so the optmization scrips can be called..
After locking the database, the script should call the guest via ssh to flush the write cache using the sync command. On linux, sync is installed as standard. On Windows it can be downloaded free from SysInternals
Use certificate based authentication to access the MySql database instead of storing the password in the script in clear text.
Use sudo to call lvcreate and mount so that the script can be run under an account other than root.
There needs to be some file management on the remote backup file store. rsync uses the inplace option so a failed backup can be restarted. Therefore the backup file is in an unusable state during the backup. So there needs to be a 'last good copy' and a working version. And we need to be sure the backup is good before replacing the last-good-copy. e.g. compare checksums. Maybe we should treat the synchronised disk like a drop-box, so the backup server polls for arrivals and copies it to a safe place. This prevents backups being lost if the host is compromised. And why not implement a Grandfather-father-son file rotation while we're at it.
To make upgrading easier, either the definitions should be moved to a configuration file, or the functions should be moved to a functions file. Maintaining the flexibility of coping with multiple VMs, any of which may have zero or more databases would be difficult with the first option.
Error handling e.g.
- If the database is not running then a warning is logged and the backup continues
- Database or SSH connection credentials fail
- The snapshot overflows the allocated space
Logging, Error reporting and alerting. The script currently writes to /var/log/backup.err & backup.log. It should use the built-in messaging mechanism (syslog?) so error messages are more likely to be found and acted upon. Also, the verbosity of the logging can be more easily adjusted.
The script should exit if it's already running. It's quite possible that rsyncing one or more entire disks could take more than 24 hours. The script should log a warning if it has taken more than 24 hours to complete.
Using SDelete to zero deleted files bloats the raw disk to its full size - it is no longer treated as sparse. One way to make a file sparse again is to copy it with the -sparse option. That's no good in this case because the guest would have to be stopped. Maybe we just have to allocate 32GB space for a 32GB file.

This is a read-only archived version of wiki.centos.org