Search code examples
pythonzipbackupunzip

How to elegantly compare zip folder contents to unzipped folder contents


This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.

The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.

My process will be like this:

  1. Initiate backup
  2. Check contents of “current” folder against what is stored in the most recent zipped backup
  3. If same – then “complete”
  4. If different, then run backup, then “complete”

Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?


Solution

  • Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.

    You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.

    CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.

    This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.