bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.

This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.

Easiest way to do this?

Thanks!

Solution

Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt

The first time you do:

$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]

The next time it'll be like this:

$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.

Except if the file on the server was updated.

That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?

You only know that, make some tests.

So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):

newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
    xz -f file.txt # overwrite with the new compressed data
else
    rm file.txt
fi

and that's done;