Search code examples
pythonzip

Different directory structures when unzipping the same file in Ubuntu vs. Windows


I am trying to extract the contents of a zip file, which can be viewed here:

https://www.geoboundaries.org/data/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-all.zip

On Ubuntu 18.04.04, with the 'extract' option from the right click menu, I get a folder structure from that zip file that includes all sorts of empty folders and directories, as well as a different parent. If I unzip the same file using 7Zip (on a windows or the same linux box), I get the expected result of 6 files.

So - what's the difference here?

(Note I already have a solution - shutil archive works - just trying to understand the different behaviors).

This is the code (python) currently being used to build the zips in question:

def zipdir(dirPath=None, zipFilePath=None, includeDirInZip=False, citeUsePath=False):
  if not zipFilePath:
    zipFilePath = dirPath + ".zip"
  if not os.path.isdir(dirPath):
    raise OSError("dirPath argument must point to a directory. "
            "'%s' does not." % dirPath)
  parentDir, dirToZip = os.path.split(dirPath)

  def trimPath(path):
    archivePath = path.replace(parentDir, "", 1)
    if parentDir:
      archivePath = archivePath.replace(os.path.sep, "", 1)
    if not includeDirInZip:
      archivePath = archivePath.replace(dirToZip + os.path.sep, "", 1)
    return os.path.normcase(archivePath)

  outFile = zipfile.ZipFile(zipFilePath, "w",compression=zipfile.ZIP_DEFLATED)
  for (archiveDirPath, dirNames, fileNames) in os.walk(dirPath):
    for fileName in fileNames:
      if(not fileName == zipFilePath.split("/")[-1]):
        filePath = os.path.join(archiveDirPath, fileName)
        outFile.write(filePath, trimPath(filePath))

  outFile.write(citeUsePath, os.path.basename(citeUsePath))
  outFile.close() 

Solution

  • The zip file geoBoundaries-2_0_0-NGA-ADM1-all.zip is non-standard.

    On Linux, unzip thinks that there are 5 files with no path components

    $ unzip -l geoBoundaries-2_0_0-NGA-ADM1-all.zip
    Archive:  geoBoundaries-2_0_0-NGA-ADM1-all.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
       374953  2020-01-15 21:04   geoBoundaries-2_0_0-NGA-ADM1-shp.zip
      1512980  2020-01-15 21:04   geoBoundaries-2_0_0-NGA-ADM1.geojson
          804  2020-01-15 21:04   geoBoundaries-2_0_0-NGA-ADM1-metaData.json
          750  2020-01-15 21:04   geoBoundaries-2_0_0-NGA-ADM1-metaData.txt
         4656  2020-01-15 21:04   CITATION-AND-USE-geoBoundaries-2_0_0.txt
    ---------                     -------
      1894143                     5 files
    

    If I then try to extract the contents I get a lot of warnings.

    $ unzip  geoBoundaries-2_0_0-NGA-ADM1-all.zip
    Archive:  geoBoundaries-2_0_0-NGA-ADM1-all.zip
    geoBoundaries-2_0_0-NGA-ADM1-shp.zip:  mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-shp.zip),
             continuing with "central" filename version
      inflating: geoBoundaries-2_0_0-NGA-ADM1-shp.zip
    geoBoundaries-2_0_0-NGA-ADM1.geojson:  mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1.geojson),
             continuing with "central" filename version
      inflating: geoBoundaries-2_0_0-NGA-ADM1.geojson
    geoBoundaries-2_0_0-NGA-ADM1-metaData.json:  mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-metaData.json),
             continuing with "central" filename version
      inflating: geoBoundaries-2_0_0-NGA-ADM1-metaData.json
    geoBoundaries-2_0_0-NGA-ADM1-metaData.txt:  mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-metaData.txt),
             continuing with "central" filename version
      inflating: geoBoundaries-2_0_0-NGA-ADM1-metaData.txt
    CITATION-AND-USE-geoBoundaries-2_0_0.txt:  mismatching "local" filename (tmp/CITATION-AND-USE-geoBoundaries-2_0_0.txt),
             continuing with "central" filename version
      inflating: CITATION-AND-USE-geoBoundaries-2_0_0.txt
    

    Analysis

    The details for each entry in a zip file, including the filename, are stored twice. Once in a local-header, directly before the compressed data and again in a central-header at the end of the file. So for every file stored in a zip file there will be a local-header / central-header pair of fields. The data in these pairs of fields should be (mostly) identical.

    In this instance they are not.

    For example, consider the central-header entry for geoBoundaries-2_0_0-NGA-ADM1-shp.zip. The matching local-header has release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-shp.zip.

    The same is true for all the entries in this zip file.

    Given that this is a non-standard/invalid zip file, the behaviour when unzipping will be down to whether the unzipping utility uses the data in the central-header entry to determine the filenames or if it uses the equivalent data in the local-header.

    Looks like Ubuntu is using the local-header fields while 7zip uses the central-header fields.

    For reference the spec for zip files is APPNOTE.TXT