Search code examples
bashshelldeduplication

bash scripting de-dupe


I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.

This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.

Easiest way to do this?

Thanks!


Solution

  • Do you really need to compress the file ?
    wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt

    The first time you do:

    $ wget -N www.example.com/file.txt
    [...]
    [...] file.txt saved [..size..]
    

    The next time it'll be like this:

    $ wget -N www.example.com/file.txt
    Server file no newer than local file “file.txt” -- not retrieving.
    

    Except if the file on the server was updated.

    That would solve your problem, if you didn't compress the file.
    If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?

    You only know that, make some tests.


    So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
    I would do something like this (in Bash):

    newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
    oldfilesum="$(xzcat file.txt.xz | sha256sum)"
    if [[ $newfilesum != $oldfilesum ]]; then
        xz -f file.txt # overwrite with the new compressed data
    else
        rm file.txt
    fi
    

    and that's done;