Search code examples
shellrepositorycomparisonupdatescdn

How to check if a big CDN file is updated without storing the same file within the repository and without database?


I've a big file to parse stored on a CDN. Every time I launch my program there is a check to control if the CDN file was updated or not. If so, I need to retrive the updated information contained in this CDN file.

Current solution is to have a copy of the CDN file also in the repository and check if there are changes with few operations:

  • download the CDN file locally
  • test if local and CDN files are different with a ShellScript
are_different_current_and_remote()
{
    diff <(curl -s "$2") "$1" > /dev/null
    if [ $? != 0 ]; then
        return 0
    else
        return 1
    fi
}
  • if so, the CDN file replaces the local one

I find this process not extremely efficient, but I'm wondering which could be the best approach.

I thought about a second approach.

  • retrieve the checksum of the remote file in the CDN with this command curl -s http://remotefile|sha1sum and store it in a file within the repository in order to check it every next time to see if there are differences AKA updates.

I'm not a big fan of this solution either, but I see it as an improvement due to less space in the repository.

Do you see even better ways to do it? Thanks a lot.


Solution

  • Getting checksum and comparing with locally calculated version would be the best solution. In your example with curl -s you still need to download the whole file and then calculate checksum locally.

    I recommend calculating checksum everytime when you update file in CDN and store it along with file somehow in CDN. Some CDN providers do that for you already. Depending on how smart your CDN provider is it can be (some examples):

    • storing SHA checksum in additional separate file which is gradually smaller and faster to download than the asset itself (so you will do curl -s https://cdn/remotefile.sha1)

    • some CDN providers calculate checksum everytime file is uploaded and then expose it as custom X-Checksum-Sha1 header in response to HTTP HEAD request (which is fast again as it doesn't retrieve file contents).

    • some CDN providers have separate REST API that allows to store and retrieve metadata about files, you can leverage that to store checksum, last update date, version tag or something else.

    If integrity check is mandatory and necessary before using software I recommend to do similar caching on client side too and everytime you fetch new file calculate the checksum and store it (e.g. in file, registry...) so startup is quick in cases when file update is not needed.