Search code examples
grepfindrsync

Rsync folder with a million files, but very small incremental daily updates


we run an rsync on a large folder. This has close to a million files inside it including html, jsp, gif/jpg, etc. We only need to of course incrementally update files. Just a few JSP and HTML files are updated in this folder, and we need this server to rsync to a different server, same folder.

Sometimes rsync is running quite slow these days, so one of our IT team members created this command:

find /usr/home/foldername \
-type f -name *.jsp -exec \
grep -l <ssi:include src=[^$]*${ {} ;`

This looks for only specific files which have JSP extension and which contain certain kinds of text, because these are the files which we need to rsync. But this command is consuming a lot of memory. I think it's a stupid way to rsync, but I'm being told this is how things will work.

Some googling suggests that this should work on this folder too:

rsync -a --update --progress --rsh --partial /usr/home/foldername /destination/server 

I'm worried that this will be too slow on a daily basis, but I can't imagine why this will be slower than that silly find option that our IT folks are recommending. Any ideas about large directory rsyncs in the real world?


Solution

  • A find command will not be faster than the rsync scan, and the grep command must be slower than rsync because it requires reading all the text from all the .jsp files.

    The only way a find-and-grep could be faster is if

    1. The timestamps on your files do not match, so rsync has to checksum the contents (on both sides!)

      This seems unlikely, since you're using -a that will sync the timestamps properly (because -a implies -t). However, it can happen if the file-systems on the different machines allow different timestamp precision (e.g. Linux vs. Windows), in which case the --modify-window option is what you need.

    2. There are many more files changed than the ones you care about, and rsync is transferring those also.

      If this is the case then you can limit the transfer to .jsp files like this:

      --include '*.jsp' --include '*/' --exclude '*'
      

      (Include all .jsp files and all directories, but exclude everything else.)

    3. rsync does the scan up front, then does the compare (possibly using lots of RAM), then does the transfer, where as find/grep/copy does it now.

      This used to be a problem, but rsync ought to do an incremental recursive scan as long as both local and remote versions are 3.0.0 or greater, and you don't use any of the fancy delete or delay options that force an up-front scan (see --recursive in the documentation).