Search code examples
zipunzip

Why do I get 2 different binary files when I 'zip' 2 identical directories?


This is on a Mac if it matters. zip is version 3.0 and unzip is version 6.0 (I expect what is shipped with the OS).

If I do the following:

Start with a generic 'pptx' file, unzip it into a directory, clean up the XML, then zip it up

unzip V1.pptx -d dir
cd dir
find . -name "*.xml" -type f -exec xmllint --output '{}' --format '{}' \;
zip -0 ../V1Orig.pptx -r *

I now have a new zip file V1Orig.pptx

unzip V1Orig.pptx -d copy
cd copy
find . -name "*.xml" -type f -exec xmllint --output '{}' --format '{}' \;
zip -0 ../V1Copy.pptx -r *

If I now 'diff' the orig and copy directories, they are the same:

Common subdirectories: orig/_rels and copy/_rels
Common subdirectories: orig/docProps and copy/docProps
Common subdirectories: orig/ppt and copy/ppt

But if I diff the pptx files or do an md5 checksum on the pptx I get a different answer.

diff V1Orig.pptx V1Copy.pptx
Binary files V1Orig.pptx and V1Copy.pptx differ

ls -rtla orig
total 8
drwxr-xr-x  11 fultonm  wheel   352 10 Jan 16:49 ppt
drwxr-xr-x   5 fultonm  wheel   160 10 Jan 16:49 docProps
drwxr-xr-x   3 fultonm  wheel    96 10 Jan 16:49 _rels
drwxr-xr-x   6 fultonm  wheel   192 14 Jan 10:40 .
-rw-r--r--   1 fultonm  wheel  3212 14 Jan 10:42 [Content_Types].xml
drwxr-xr-x   8 fultonm  wheel   256 14 Jan 10:57 ..
fultonm@mikes-MacBook-Pro-2 /tmp/handzip>ls -rtla copy
total 8
drwxr-xr-x   5 fultonm  wheel   160 14 Jan 10:42 docProps
drwxr-xr-x   3 fultonm  wheel    96 14 Jan 10:42 _rels
drwxr-xr-x   6 fultonm  wheel   192 14 Jan 10:42 .
drwxr-xr-x  11 fultonm  wheel   352 14 Jan 10:42 ppt
-rw-r--r--   1 fultonm  wheel  3212 14 Jan 10:42 [Content_Types].xml
drwxr-xr-x   8 fultonm  wheel   256 14 Jan 10:57 ..

Solution

  • You can get them to be the same by making the timestamps of all of the files and directories to be the same, and by using the -X option to not save extra file attribute information.

    So for each zip command, use -rX, and in the copy directory do:

    find . -exec touch -r ../dir/{} {} \;
    

    before the zip.

    Why it should matter that the zip files be identical, I have no idea. What matters is that they both decompress to the same thing.