Search code examples
command-lineftpwgetmirroring

wget with timestamping repeatedly downloads same files


I am connecting to an FTP server with a few directories, a couple of levels deep. These directories contain various versions of the same files: the same unique filename, with different timestamps, could be in multiple directories, and there is no knowing where the latest version of each file will end up. I don't control this server and will readily admit it's a dumb situation.

I have been using wget with --timestamping to try and grab the latest versions of each file, with the --no-directories option to squeeze it all into one set of latest files. In my head, this should just magically end up with the latest version of every file appearing once in a single place, despite recursing over all the directories on the server. However, I am noticing that a lot of the time files are being redownloaded, despite manually verifying that local timestamps are identical to those on the FTP server.

Is there something about --no-directories that interferes with wget's timestamping?

The command line I am issuing is this:

wget -q --show-progress --no-directories -r -N -l inf ftp://user:password@ftp.example.com/

If I target just a single directory like this, the behaviour is as I'd expect (for the subset of files within that directory):

wget -q --show-progress --no-directories -r -N -l 1 ftp://user:password@ftp.example.com/subdir/

But the moment I try and mirror from the root the timestamps seem to go out the window.


Solution

  • The answer is that wget's timestamping doesn't only care about time. It actually checks the size the of the file first, in which case it treats any difference as being worth re-downloading. So in my case of multiple directories with different versions of the same file, if you have the newer file, it'll download the older file. Then when you have the older file, it'll download the newer file. That means that doing a recursive download will result in the same file being overwritten with effectively random versions multiple times, and leaving you very unlikely to have the latest version of any particular file.

    This seems like an awful betrayal of the user's intuition even though it's technically mentioned in the wget docs (in some places but not others), but there you go. Timestamping has little to do with timestamps.