I am trying to extract the contents of a zip file, which can be viewed here:
https://www.geoboundaries.org/data/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-all.zip
On Ubuntu 18.04.04, with the 'extract' option from the right click menu, I get a folder structure from that zip file that includes all sorts of empty folders and directories, as well as a different parent. If I unzip the same file using 7Zip (on a windows or the same linux box), I get the expected result of 6 files.
So - what's the difference here?
(Note I already have a solution - shutil archive works - just trying to understand the different behaviors).
This is the code (python) currently being used to build the zips in question:
def zipdir(dirPath=None, zipFilePath=None, includeDirInZip=False, citeUsePath=False):
if not zipFilePath:
zipFilePath = dirPath + ".zip"
if not os.path.isdir(dirPath):
raise OSError("dirPath argument must point to a directory. "
"'%s' does not." % dirPath)
parentDir, dirToZip = os.path.split(dirPath)
def trimPath(path):
archivePath = path.replace(parentDir, "", 1)
if parentDir:
archivePath = archivePath.replace(os.path.sep, "", 1)
if not includeDirInZip:
archivePath = archivePath.replace(dirToZip + os.path.sep, "", 1)
return os.path.normcase(archivePath)
outFile = zipfile.ZipFile(zipFilePath, "w",compression=zipfile.ZIP_DEFLATED)
for (archiveDirPath, dirNames, fileNames) in os.walk(dirPath):
for fileName in fileNames:
if(not fileName == zipFilePath.split("/")[-1]):
filePath = os.path.join(archiveDirPath, fileName)
outFile.write(filePath, trimPath(filePath))
outFile.write(citeUsePath, os.path.basename(citeUsePath))
outFile.close()
The zip file geoBoundaries-2_0_0-NGA-ADM1-all.zip
is non-standard.
On Linux, unzip
thinks that there are 5 files with no path components
$ unzip -l geoBoundaries-2_0_0-NGA-ADM1-all.zip
Archive: geoBoundaries-2_0_0-NGA-ADM1-all.zip
Length Date Time Name
--------- ---------- ----- ----
374953 2020-01-15 21:04 geoBoundaries-2_0_0-NGA-ADM1-shp.zip
1512980 2020-01-15 21:04 geoBoundaries-2_0_0-NGA-ADM1.geojson
804 2020-01-15 21:04 geoBoundaries-2_0_0-NGA-ADM1-metaData.json
750 2020-01-15 21:04 geoBoundaries-2_0_0-NGA-ADM1-metaData.txt
4656 2020-01-15 21:04 CITATION-AND-USE-geoBoundaries-2_0_0.txt
--------- -------
1894143 5 files
If I then try to extract the contents I get a lot of warnings.
$ unzip geoBoundaries-2_0_0-NGA-ADM1-all.zip
Archive: geoBoundaries-2_0_0-NGA-ADM1-all.zip
geoBoundaries-2_0_0-NGA-ADM1-shp.zip: mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-shp.zip),
continuing with "central" filename version
inflating: geoBoundaries-2_0_0-NGA-ADM1-shp.zip
geoBoundaries-2_0_0-NGA-ADM1.geojson: mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1.geojson),
continuing with "central" filename version
inflating: geoBoundaries-2_0_0-NGA-ADM1.geojson
geoBoundaries-2_0_0-NGA-ADM1-metaData.json: mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-metaData.json),
continuing with "central" filename version
inflating: geoBoundaries-2_0_0-NGA-ADM1-metaData.json
geoBoundaries-2_0_0-NGA-ADM1-metaData.txt: mismatching "local" filename (release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-metaData.txt),
continuing with "central" filename version
inflating: geoBoundaries-2_0_0-NGA-ADM1-metaData.txt
CITATION-AND-USE-geoBoundaries-2_0_0.txt: mismatching "local" filename (tmp/CITATION-AND-USE-geoBoundaries-2_0_0.txt),
continuing with "central" filename version
inflating: CITATION-AND-USE-geoBoundaries-2_0_0.txt
Analysis
The details for each entry in a zip file, including the filename, are stored twice. Once in a local-header
, directly before the compressed data and again in a central-header
at the end of the file. So for every file stored in a zip file there will be a local-header
/ central-header
pair of fields. The data in these pairs of fields should be (mostly) identical.
In this instance they are not.
For example, consider the central-header
entry for geoBoundaries-2_0_0-NGA-ADM1-shp.zip
. The matching local-header
has release/geoBoundaries-2_0_0/NGA/ADM1/geoBoundaries-2_0_0-NGA-ADM1-shp.zip
.
The same is true for all the entries in this zip file.
Given that this is a non-standard/invalid zip file, the behaviour when unzipping will be down to whether the unzipping utility uses the data in the central-header
entry to determine the filenames or if it uses the equivalent data in the local-header
.
Looks like Ubuntu is using the local-header
fields while 7zip uses the central-header
fields.
For reference the spec for zip files is APPNOTE.TXT