Rsync Question: Building List Taking Forever

Home » CentOS » Rsync Question: Building List Taking Forever
CentOS 19 Comments

Guys,

I’ve setup an rsync between two directories that I’ve mounted locally on a jump box. Long story short, the two directories are both NFS shares from two different hosts. Our security dept won’t allow us to SSH between the two data centers, directly. But the jump host can contact both. So what I’ve done is mount the NFS shares from one host in each data center on the jump box using SSHFS.

The directory I’m trying to rsync from has 111GB of data in it. I don’t think I’ve ever setup an rsync for quite so much data before.

But I started the rsync at approx. 7pm last night. And as of now the rsync is still building it’s file list.

[root@sshproxygw ~]# rsync -avzp /mnt/db_space/timd/www1/
/mnt/db_space/timd/www2/svn2/
building file list …

So my question to you is, is this a normal amount of time to wait for the file list to be built? Considering the amount of data involved. I have it running in a screen session I can attach to to find out what’s going on.

Thanks Tim

19 thoughts on - Rsync Question: Building List Taking Forever

  • 2014-10-19 18:55 GMT+03:00 Tim Dunphy :

    Regerating rsync file list can take a very long time :/

    Make sure that you are using fast link on both ends and at least version 3
    of rsync.

  • Hmm you mean specify tcp for rsync? I thought that’s default. But holy crap, you were right about it taking a long time to build a file list! The rsync just started a few minutes ago… !

    dumps/dotmedia.031237.svndmp dumps/dotmedia.031238.svndmp dumps/dotmedia.031239.svndmp dumps/dotmedia.031240.svndmp dumps/dotmedia.031241.svndmp dumps/dotmedia.031242.svndmp dumps/dotmedia.031243.svndmp dumps/dotmedia.031244.svndmp dumps/dotmedia.031245.svndmp

  • can this ‘jump host’ SSH to either of the servers ? it might be worth using rsync-over-ssh protocol on one side of this xfer, probably the destination.

    so rsync -avh… /sourcenfs/path/to… user@desthost:path

  • No, he means use TCP for NFS (which is also the default).

    I suspect that SSHFS’s relatively poor performance is having an impact on your transfer. I have a 30TB filesystem which I rsync over an OpenVPN link, and building the file list doesn’t take that long (maybe an hour?). (The links themselves are reasonably fast; if yours are not that would have a negative impact too.)

    If you have the space on the jump host, it may end up being faster to rsync over SSH (not using NFS or SSHFS) from node 1 to the jump host, then from the jump host to node 2.

    –keith

  • Don’t forget that the time taken to build the file list is a function of the number of files present, and not their size. If you have many millions of small files, it will indeed take a very long time. Over SSHFS with a slowish link, it could be days.

    Steve

  • Another option that might help is to break the transfer up into smaller pieces. We have a 3TB filesystem that has a lot of small data files in some of the subdirectories and it used to take a long time (close to an hour) and impacted fs performance to build the file list. But, since the volume mount point has only directories beneath it, we were able to tweak our rsync script to iterate over the subdirectories as individual rsyncs. Not only did that isolate the specific directories with the large number of files to their own rsync instances but an added bonus of this is that if for some reason there is an error in a given rsync attempt, the script is written to pick up at the same area and try again
    (a couple times) and does not then need to restart the entire filesystem rsync.

    Hope this helps!
    Miranda

  • Well, sure. My assumption is that the OP’s ~120GB of storage was likely not more files than my 30TB. (I have a lot of large files, but a lot of small files too.)

    –keith

  • Ahhh, but isn’t that part of the beauty of adventure that being a linux admin is all about? *twitch*

  • Are you “allowed” to temporarily run an SSH tunnel (or stunnel) on your jumpbox?
    So connecting from host1 to jumpbox on port XXX would be tunneled to SSH port on host2…

    Or with netcat (if you can mkfifo)?
    mkfifo backpipe
    nc -l 12345 0backpipeBut you will have to trick SSH into accepting your jumpbox “fingerprint”…

    JD

  • Or perhaps easier (depending on how paranoid sshd configs are) with ProxyCommand in ssh/config, i.e., setup config so one SSH command can get you logged onto the final target and then use rsync across SSH as per normal:

    http://sshmenu.sourceforge.net/articles/transparent-mulithop.html

    Then rsync will be running on both ends, where the data (filesystem information) is LOCAL, i.e., fast.

    I would use, if possible/allowed, key[s] with ssh(-agent) to make the whole connect to multiple hosts thing easier (i.e., fewer passphrase requests).

    [OP: `they don’t allow SSH between the datacenters` …but… they nfs between them…??? ME: much head scratching.]

    Even when this disclaimer is not here:
    I am not a contracting officer. I do not have authority to make or modify the terms of any contract.

  • There’s not that much magic involved. The time it takes rsync to read a directory tree to get started should approximate something like
    ‘find /path -links +0″ (i.e. something that has to read the directory tree and the associated inodes). Pre-3.0 rsync versions read and transfer the whole thing before starting the comparison and might trigger swapping if you are low on RAM.

    So, you probably want the ‘source’ side of the transfer to be local for faster startup. But… in what universe is NFS mounting across data centers considered more secure than ssh? Or even a reasonable thing to do? How about a VPN between the two hosts?

  • OK, I don’t see how that is possible because something would either be mounted as nfs or SSHFS, not one over the other. But if the intermediate host is allowed to SSH (as it must for SSHFS to work) I’d throw a bunch of disk space at the problem and rsync a copy to the intermediate host, then rsync from there to the target at the other data center. Or work out a way to do port-fowarding over ssh connections from the intermediate host.

  • I’m just repeating what he wrote; perhaps the OP can elaborate.

    That was one of my suggestions earlier in the thread.

    Or (as you and perhaps I suggested) some sort of OpenVPN link.

    –keith

  • Adventure? Nah, that’s why my rsync scripts rsync chunks of the filesystem rather than all of it in one go, and why it gets to run twice each time. Once bitten, twice shy.

    Cheers,

    Cliff

  • We have a Value Added Network (VAN) provider who insists on a similar thing. We were given sftp access to our transaction files but not ssh. They also will not run an rsync daemon for us.

    We ended up locally mounting the remote host directories with fuse-ssh (sshfs)
    over sftp using RSA keys to bypass the interactive logins.