Rsync And Differential Backups

Home » CentOS » Rsync And Differential Backups
CentOS 45 Comments

Hi list, how to perform a differential backup using rsync?

On web there is a great confusion about diff backup concept when searched with rsync.

Users says diff because it copy only differences. For me differential is backup from last full backup.

Other users says that to perform a differential backup I must include in rsync command: –backup –backup-dir=/some/path but from manual page of rsync:

#############
–backup-dir=DIR
In combination with the –backup option, this tells rsync to store all backups in the specified directory on the receiving side. This can be used for incremental backups. You can additionally specify a backup suffix
using the –suffix option (otherwise the files backed up in the specified directory will keep their original filenames).
….
###################

Then at this point, I can perform a full backup copying base dir after last incremental. I can performa an incremental backup saving change on a specified destdir (using –backup-dir).

How I can perform a diff backup?

I know that rsync check differences using “the base dir”. This dir have
“the same content” of backupped source. To make incremental, this base is used. Supposing that I’ve 500 GB data on source. Make/sync the base-dir of 500GB. Running a full backup (the result file must be a fullbackup.tar.gz), at the end of the process I get a base-dir of 500GB and a .tar.gz of +/-
500GB compressed. Is correct make full backup, performing first an incremental backup on the base-dir and then compress it on a .tar.gz? Or is better resync all source in alternative destdir?

In this example I’ve spent the double space for a full and a base-dir.
500GB Source vs 1TB for base-dir and a full.tar.gz. There is a way to performs other operation (incr and diff) without using the base and save disk space?

Thanks in advance.

45 thoughts on - Rsync And Differential Backups

  • Hi

    For backups with rsync a recommend you to follow the approach discussed on this website. It provides you everything for getting a full backup and then the incremental ones (deltas) using rsync. The only thing you need in order to do that is that the hosting filesystem supports hard links,

    http://www.mikerubel.org/computers/rsync_snapshots/

    Cheers, Roberto Nebot

    2015-11-09 17:01 GMT+01:00 Alessandro Baggi :

  • Differential comes from real backup systems. Rsync is much simpler IMHO,
    “-b” backup flag only keeps older version or deleted file/directory with extra “~” (or whatever you define) in its name. Making rsync behaving as full blown backup system is too time consuming. Much less time consuming will be to just to install some backup software. BackupPC I would recommend for simple case like I understand yours is. Bacula will be my choice when I need enterprise level system.

    Just my $0.02.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • rsync backups are always incremental against the most recent backup
    (assuming you’re copying to the same location).

    I don’t see the distinction you’re making.

    rsync examines each file. If you specify –delete, files that are in the destination but not the source will be removed. Generally, files that match last-modified-time and size will not be copied, but flags like -c change the criteria for determining whether a file needs to be copied. Files which do not match will be copied using an efficient algorithm to send the minimum amount of data (just the changes in the file) from the source to the destination.

    You probably only need to use –backup-dir on systems which don’t have GNU cp. On systems with GNU cp, differential backups normally do something like:

    cp -a daily.0 daily.1
    rsync -a –delete source/ daily.0/

    Whereas with –backup-dir, you can use rsync to do both tasks in one command, but your directory layout is a little messier.

    Save yourself a lot of trouble and use a front-end like rsnapshot or BackupPC.

  • a incremental backup copies everything since the last incremental

    a differential copies everything since the last full.

    rsync is NOT a backup system, its just a incremental file copy

    with the full/incremental/differential approach, a restore to a given date would need to restore the last full, then the last differential, then any incrementals since that differential, for instance, if you do monthly full, weekly differential and daily incrementals. If you don’t use differentials, then you’d have to restore every incremental since that last full, which in a monthly full, daily incremental scenario could be as many as 30 incrementals.

  • I guess that makes sense, but in backup systems based on rsync and hard links (such as rsnapshot), *every* backup on the backup volume is a
    “full” backup, so incremental and differential are the same thing.

    ..which can be used as a component of a backup system, such as rsnapshot or BackupPC.

  • I beg to differ.

    The rsync command is a fantastic backup system. It may not meet your needs, but it works really great to make different types of backups for me. I have a script I use (automate everything) to perform nightly backups with rsync. Using rsync with USB external hard drives works far better than any other backup system I have ever tried.

    As for your other statements, they may be meaningful to you and that is OK, but to me are just so much irrelevant semantics. If one’s backup system works, terminology and which commands used to achieve it are beside the point – it is a true backup system.

  • Gordon Messmer wrote:

    Actually, we use rsync for backups. We have a script that creates a new daily directory… and uses hard links to previous dates. That way, it looks like a full b/u… but you can go to a previous date to restore an older version of the file (aka ACK! I saved that file full of garbage to my Great American Novel filename! ).

    And if you aren’t familiar with hard links, which rsync happily creates, they were certainly hard enough to wrap my head around, until I got it… and really like them. Just note that they *must* be on one filesystem, as opposed to symlinks, which can cross filesystems.

    mark

  • More than one filename for a particular file. What’s difficult about that?

    Obviously, since a hard link is part of the file and directory structure of the filesystem.

  • I wonder how filesystem behaves when almost every file has some 400 hard links to it. (thinking in terms of a year worth of daily backups).

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • cp -al daily.0 daily.1

    All these can be combined with an rsyncd module to allow read only root access to a remote system excluding the dirs you don’t normally want to be backed up like /proc, /var/lib/mysql, /var/lib/libvirt, …

    Oops… My provider email gateway has been blacklisted by anti spam vigilantes.

  • I think the difficult part is that so many people don’t understand that EVERY regular file is a hard link. It doesn’t mean “more than one” at all. A hard link is the association between a directory entry
    (filename) and an inode in the filesystem.

  • Why do you think that would be a problem?

    Most inodes have one hard link. When that link is removed, the link count in the inode is decremented (inodes are reference-counted, you can see their ref count in “ls -l” output). When the link count reaches 0
    and no open file descriptors exist, the inode is removed.

    Creating more hard links just increases the ref count. That’s it. It’s not a weird special case.

  • XFS handles this fine. I have a BackupPC storage pool with backups of
    27 servers going back a year… now, I just have 30 days of incrementals, and 12 months of fulls, but in BackupPC’s implementation the distinction between incremental and full is quite blurred as both are fully deduped across the whole pool via use of hard links.

    * Pool is 5510.40GB comprising 9993293 files and 4369 directories (as
    of 11/9 02:08),
    * Pool hashing gives 3452 repeated files with longest chain 54,
    * Nightly cleanup removed 737 files of size 1.64GB (around 11/9 02:08),
    * Pool file system was recently at 35% (11/9 11:44), today’s max is
    35% (11/9 01:00) and yesterday’s max was 36%.

    There are 27 hosts that have been backed up, for a total of:

    * 441 full backups of total size 71125.43GB (prior to pooling and
    compression),
    * 623 incr backups of total size 20775.88GB (prior to pooling and
    compression).

    so 90+TB of backups take 5.5TB of actual space.

  • Probably not. You are not impacting something that has notably finite count (like inode count on given fs). You just use a bit more disk space for metadata which is nothing (space wise) compared to data (the files themselves). Thanks!

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Now that you point that out, I agree. I never thought about it that way before since I’ve always looked at a hard link as a link that you create after you create the initial file, though they become interchangeable after that.

    But you’re absolutely right and I’ve learned something today. Thanks!

  • on Unix systems, the actual ‘file’ is known as an inode, and is identified by a inode number. Directories are other files that contain indexed directory entries with filenames pointing to these inodes.

    the tricky thing with hard links is, you have to walk the whole directory tree of a given file system to find every entry pointing to the same inode if you want to identify these links.

  • Valeri Galtsev wrote:

    That, I can’t answer – what we have is “disaster recovery”, not “archive”, so we only keep them for no more than five weeks.

    On the other hand… a reasonable approach would be for, over maybe two months old, to keep the first of the month, and rm everything else for the month.

    mark

  • Ciao Alessandro,

    Which is basically the same…if you always use your last full backup as
    “base” directory. Use rsyn’s –link-dest option to achieve this. Nice thing: Unchanged files will just be hardlinked to the original files and won’t use additional disk space, but still each dataset is a coopmlete backup. There is no need to combine several incremental or differential backups to restore a certain state.

    Mike Rubel’s page has already been mentioned. On http://www.drosera.ch/frank/computer/rsync.html I describe an alternate mechanism (using above mentioned –link-dest and an rsync-server) which overcomes some of the – imho – shortcomings of Mike’s setup.

    And: rsync is a fan-tas-tic backup tool ;-)

    HTH
    Frank

  • I’m sure you know this already, but for those who may not, be sure to mount your XFS filesystem with the inode64 option. Otherwise XFS will try to save all of its inodes in the first 1TB of space, and with so many inodes needed, you may run out more quickly than you anticipate. Then you’ll have “no space left on device” errors when df reports plenty of space (at least till you do df -i; actually I’m not 100% sure df -i will show it).

    –keith

  • I’m fully with you on -o inode64, but I would think it is not inode number that becomes large with extensive use of hard links, but the space used by directory data, thus requiring to relocate these once they exceed some size so ultimately some of them will be pushed beyond 1 TB border
    (depending on how the filesystem is used). Someone, correct me if I’m wrong.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • You can use “newer” options of the find command and pass the file list to rsync or scp to “backup” only those files that have changed since the last run. You can keep a file like .lastbackup and timestamp it
    (touch) at the start of the backup process. Next backup you compare the current timestamp with the timestamp on this file.

    HTH,
    — Arun Khan

  • Folks

    I have been using rsnapshot for years now. The only problem I’ve found is that it is possible to run out of inodes. So my heads-up is that when you create the file system, ensure you have more than the default inodes – I
    usually multiply the default by 10. Otherwise you can find your 1Tb USB
    drive failing after 259Mb and you can’t then recover the files. Rather embarrassing.

    Best wishes

    John

    John Logsdon Quantex Research Ltd
    +44 161 445 4951/+44 7717758675

  • Clarification — for diffrential back ups, you should touch the file only when you do the *full* backup.

    — Arun Khan

  • Thanks John – I haven’t used XFS.

    This issue arose on ext3 I think some years ago on a rather elderly system. If XFS avoids this that’s great but if someone is still using legacy systems, they need to be warned!

    Best wishes

    John

    John Logsdon Quantex Research Ltd
    +44 161 445 4951/+44 7717758675

  • Alessandro Baggi wrote:

    I think the answer to this question is Rsnapshot, which is an old and well proven tool: http://rsnapshot.org/. To quote the homepage:

    It’s very easy to set up Rsnapshot with a simple configuration file. You can define hourly, daily, weekly and monthly backups. You run Rsnapshot from crontab and it’s all automatic.

    We run Rsnapshot on the backup server which NFS-mounts the source file system as read-only (NFS read operations are low overhead). You may also use SSH.

    The backup server has XFS filesystems with the inode64 mount option (no inode problems :-). Our backup filesystems varies from 1-40 TB in size with 100k to 25M inodes. For very large source filesystems you may want to use 10 Gbit/s links to speed up the backups.

    The only issue I see is with source filesystems containing tens of millions of files, since new inodes have to be allocated on the backup server to “snapshot” a filesystem. The time for Rsnapshot to do
    “snapshots” can become substantial, depending on the speed of the underlying filesystem.

  • Absolutely none of that is necessary with rsync, and the process you described is likely to miss files that are modified while “find” runs.

    If you’re going to use rsync to make backups, just use a frontend like rsnapshot or BackupPC.

  • Well, be fair, rsync can also miss files if files are changing while the backup occurs. Once rsync has passed through a given section of the tree, it will not see any subsequent changes.

    If you need guaranteed-complete filesystem-level snapshots, you need to be using something at the kernel level that can atomically collect the set of modified blocks/files, rather than something that crawls the tree in user space.

    On the BSD Now podcast, they recently told a war story about moving one of the main FreeBSD servers to a new data center. rsync was taking 21 hours in back-to-back runs purely due to the amount of files on that server, which gave plenty of time for files to change since the last run.

    Solution? ZFS send:

    http://128bitstudios.com/2010/07/23/fun-with-zfs-send-and-receive/

  • I think you miss my meaning. Consider this sequence of events:

    * “find” begins and processes dirA and then dirB
    * another application writes files in dirA
    * “find” completes
    * a new timestamp file is written

    Now, the new file in dirA wasn’t seen by find during this run, and it won’t be seen on the next run either. That’s what I mean by missed.
    Not temporarily missed, but permanently. That file won’t ever be backed up in this very na

  • That’s plain bad system analysis. Read the start date, record the current date and THEN start processing. You will get the odd extra file but will not loose any.

    —–BEGIN PGP SIGNATURE—

  • If I may, I’d like to put in a plug for ZFS:

    Combining rsync and ZFS, you can rsync, then make a ZFS snapshot, which gives you the best of both worlds:

    1) No messy filesystem with multiple directories full of hardlinks to manage.
    2) Immutable backups.
    3) Crazy efficient storage space, including built-in compression. Much more efficient than rsync + hard links.
    4) Ability to send the entire filesystem (binary perfect) to another system.
    5) Ability to upgrade and add storage space without taking it offline.
    6) Ability to “restore” a snapshot to read/write status in seconds with a clone that you can throw away later just as easily.
    7) Or you can skip rsync, do the snapshots on the source server, and replicate the snapshots with send/receive.
    8) Uses inexpensive, commodity hardware.

    … and on and on….

    We’ve moved *all* our backups to ZFS, the benefits are just too many. I’d like to plug BTRFS in a similar vein, but it’s “not yet production ready” and it’s been that way for a long, long time…

    Ben S

  • <.... snip ...>

    A good systems analysis is a must in whatever one does. Be it system admin, software developer, accountant, lawyer etc.

    My suggestion about using “find” was in response to OP’s question/clarification on incremental/differential backup and I
    assumed due diligence with respective to designing the script.


    how to perform a differential backup using rsync?

    On web there is a great confusion about diff backup concept when searched with rsync.

    rsync will do incremental backup as already discussed earlier in this thread.

    Please suggest how to achieve a differential backup with rsync (the original query).

    Thanks,
    — Arun Khan

  • Already answered. Under rsync based backup systems like rsnapshot, every backup is a full backup. Therefore, incremental and differential backups are the same thing. As you already understand that rsync will do incremental backups without using find, you also understand that it will do differential backups without using find.

  • I did exactly this with ZFS on Linux and cut over 24 hours of backup lag to just minutes.

    If you’re managing data at scale, ZFS just rocks…

  • FFS, don’t do the latter. LVM is the standard filesystem backing for Red Hat and CentOS systems, and fully supports consistent snapshots without doing half-ass shit like breaking a RAID volume.

    Breaking a RAID volume doesn’t make filesystems consistent, so when you try to mount it, you might have a corrupt filesystem, or corrupt data.
    Breaking the RAID will duplicate UUIDs of filesystems and the name of volume groups. There are a whole bunch of configurations where it just won’t work. At best, it’s unreliable. Never do this. Don’t advise other people to do it. Use LVM snapshots (or ZFS if that’s an option for you).

  • Maybe I should have been clearer: use (LVM) OR (RAID1 and break). Don’t use LVM and break, that would be silly.

    I hope I’m wrong, but you wouldn’t be thinking of mounting the broken out copy on a the same system would you? You must never do that, not even during disaster recovery. Use dd or similar on the disk, not the mounted partitions – isn’t that obvious? I wasn’t trying to give step by step instructions.

    Way before LVM existed we used this technique to back up VAXes (and later Alphas) under VMS using “volume shadowing” (ie RAID1). It worked quite happily for several years with disks shared across the cluster. IIRC it was actually recommended by DEC, indeed a selling point, but I
    don’t have any manuals to hand to confirm that nowadays! One thing I
    did omit was you MUST sync first (there was an equivalent VMS command, don’t ask me now), and also ensure that as the disks are added back a full catchup copy occurs. You may consider it half a mule’s droppings, but it is, after all, what happens if you loose a spindle and hot replace.
    —–BEGIN PGP SIGNATURE—

  • I took your meaning. I’m saying that’s a terrible backup strategy, for a list of reasons.

    For instance, it only works if you mirror a single disk. It doesn’t work if you use RAID10 or RAID5, or RAID6, or RAIDZ, etc. Breaking RAID
    doesn’t make the data consistent, so you might have corrupt files
    (especially if the system runs any kind of database. SQL, LDAP, etc).
    It doesn’t make the filesystem consistent, so you might have a corrupt filesystem.

    Even if you ignore the potential for corruption, you have a backup process that only works on some specific hardware configurations. Everything else has to have a different backup solution. That’s insane. Use one backup process that works for everything. You’re much more likely to consistently back up your data that way.

    Well, that’s *one* of the problems with your advice. Even if we ignore the fact that it doesn’t work reliably (and IMO, it therefore doesn’t work), it’s far more complicated than you pretend it is.

    Because now you’re talking about quiescing your services, breaking your RAID, physically removing the drive, connecting it to another system, fsck the filesystems, mount them, and backing up the data. For each backup. Every day.

    Or using ‘dd’ and… backing up the whole image? No incremental or differentials?

    Your process involves a human being doing physical tasks as part of the backup. Maybe I’m the only one, but I want my backups fully automated.
    People make mistakes. I don’t want them involved in regular processes.
    In fact, the entire point of computing is that the computer should do the work so that I don’t have to.

    sync flushes the OS data buffers to disk, but it does not sync application data buffers, it does not flush the journal, it doesn’t make filesystems “clean”, and even if you break the RAID volume immediately after “sync” there’s no guarantee that there weren’t cached writes from other processes in between those two steps.

    There is absolutely no way to make this a reliable process without a full shutdown.

  • That’s just being picky for the sake of it. A backup is a *point-in-time* snapshot of the files being backed up. It will not capture files modified after that point.

    So, saying that find won’t find files modified while the backup is running is frankly the same as saying it won’t find files modified anytime in the future after that *point-in-time* when the backup started!

    If there’s a point to be made by the quoted statement above, I missed it and I surely deserve to be educated!

    ak.

  • Have a coffee or a beer, breathe deeply, then:

    That of course is exactly why I said RAID1.

    Breaking RAID

    Possibly, but that is another problem altogether. Any low level backup will do the same. You need to have an understanding of the filesystem to handle filesystem problems. Even if the utility understands the filesystem you have problems with open files such as databases.

    More generally, for anything except a trivial database you should use the database to dump itself; for instance using mysqldump. Have a look at the page https://mariadb.com/kb/en/mariadb/backup-and-restore-overview/ for (as it says) an overview. Try running a database backup timed to complete before your normal filesystem backups run, whatever method you use.

    Remember that this is a last resort if (1) the user can’t accept more sensible backups and handle (or let the backup handle) the dates safely; (2) the user insists on a snapshot; (3) the user can’t use a filesytem snapshot (ZFS, GPFS etc) and (4) the user can’t/won’t use LVM. You can’t refuse to use better solutions” and then complain that last resort is not as good as the better solutions”!

    No need to remove if you handle whole disk. When we used this technique we only did it monthly – it would be pretty crazy to do level 0 backups daily.

    See the previous.

    See the comments about using better solutions. I’d be worried though if you use a solution that doesn’t remove the backup media from the vicinity of the machine. Fine if you have a remote site, but otherwise you still need a person to physically take the tapes (or whatever) out of the machine room to fireproof storage. That’s pretty manual.

    The journal is a fair point if it is stored on an separate spindle, as for instance is possible under XFS.

    Not IME. At that date the preferred method for monthly backups was a shutdown and standalone utility for disk-disk copies, but that was not always possible. The technique worked.

    —–BEGIN PGP SIGNATURE—

  • While using LVM arranges for some filesystems to be consistent (it is not always possible), it does nothing to ensure application consistency which can be just as important. Linux doesn’t have a widely deployed analog to Windows’ VSS, which provides both though only for those that cooperate. On Linux you must arrange to quiesce applications yourself, which is seldom possible.

    Making an LVM snapshot duplicates UUIDs (and LABELs) too, the whole LV
    is the same in the snapshot as it was in the source. There are ways to cope with that for XFS (I usually use mount -ro nouuid) — ext2/3/4
    doesn’t care (so just mount -r for them). If the original filesystem isn’t yet mounted then a mount by uuid (or label) would not be pretty for either. And that’s just two filesystems, others are supported and they too will potentially have issues.

    /mark

  • I know. And I was trying to make the point that the process of breaking RAID1 for backup purposes is inflexible in addition to being unreliable. Users should not have to re-engineer their backup system for every hardware configuration.

    If you were to attempt a block-level backup of the raw device, then yes, you would have similar problems. But since that is insane, and no one is suggesting that process, I didn’t feel the need to address it.

    There *are* tools that exist to dump filesystems, but they’re not intended to be used for backup, and they won’t operate on mounted filesystems. For instance, clonezilla includes tools to dump ext4 and ntfs filesystems for the purpose of cloning a system. You could treat that as a backup, but you have to shut down the host OS to boot clonezilla.

    Uhh…. no. I’d argue the opposite. You should only use a DB dump tools for trivial databases (or in some cases, such as PostgreSQL, upgrades). Dumping a database is *slow*. The only thing slower than dumping a database is restoring a database dump. If you have a non-trivial database, you definitely want to quiesce, snapshot, resume, and back up the snapshot.

    Again, you seem entirely too willing to accept unreliable processes.
    Timing? You should absolutely, under no circumstances, trust the timing of two processes to not overlap. If you’re dumping data, you should either trigger the backup from the dump job, after it completes, or you should employ a locking system so that only one of the two processes can operate simultaneously.

    No one is refusing better solutions. You are tilting at windmills.

    We agree, there. You should have backups in a physically separate location.

  • Can you explain what you mean? The standard filesytems, ext4 and XFS, both will be made consistent when making an LVM snapshot.

    I know. That’s why I wrote snapshot:
    https://bitbucket.org/gordonmessmer/dragonsdawn-snapshot

    I have not found that to be true. Examples?

    The VG name is the bigger problem. If you tried to activate the VG in the broken RAID1 component, Very Bad Things(TM) would happen.