Search code examples
pythonzipconcatenationgoogle-colaboratorypython-zipfile

Concatenating zip files in an incremental archived format directly in python (to use in Colab)


Need to concatenate some files from github which have been split into several pieces due to the size (as from this dataset https://github.com/kang-gnak/eva-dataset)

Using request these end up in my temporary data storage in the format File_Name.zip.001 to File_Name.zip.007

The completed file is not text but images so I haven't found a straightforward way to rebuild File_Name.zip from code.

Is anyone aware of a solution that would work directly in Colab?

I am looking for both repeatability and the ability to share my code as a Colab notebook, so I am trying to avoid solutions that involve having to download and rebuild the file locally and reuploading it each time. I would also prefer not to have to make an online copy of existing data if there's a way to rebuild and unzip the file directly from the code.

Thanks in advance.

I attempted using a list of the parts' file names assigned to

data_zip_parts

and run the following code:

with zipfile.ZipFile(data_path / "File_Name.zip", 'a') as full_zip:
    for file_name in data_zip_parts:
        part = zipfile.ZipFile(data_path / file_name, 'r')
        for name in part.namelist():
            full_zip.writestr(name, zipfile.open(name).read())

However looks like this file format cannot be read directly so I get the following error:

BadZipFile: File is not a zip file

Just a reminder that I want to try to do this directly within Google Colab: I have asked a few peers but most of them gave me solutions to run on my local system such as command line or using 7zip, which isn't quite what I'm looking for, but I expect there may be a way to work around this format, and would appreciate the assistance.


Solution

  • Understanding the Issue

    I downloaded the dataset from https://github.com/kang-gnak/eva-dataset to see what you are dealing with

    $ ls -lh *
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.001
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.002
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.003
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.004
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.005
    -rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.006
    -rw-rw-r-- 1 paul paul  70M Oct  7 04:11 EVA_together.zip.007
    

    Let's see what the file command says about the content of these files

    $ file *
    EVA_together.zip.001: Zip archive data, at least v2.0 to extract, compression method=store
    EVA_together.zip.002: data
    EVA_together.zip.003: data
    EVA_together.zip.004: data
    EVA_together.zip.005: data
    EVA_together.zip.006: data
    EVA_together.zip.007: OpenPGP Public Key
    

    As I expected, only the first is actually appears to be a zip file, but even it has problems

    $ unzip -t EVA_together.zip.001
    Archive:  EVA_together.zip.001
      End-of-central-directory signature not found.  Either this file is not
      a zipfile, or it constitutes one disk of a multi-part archive.  In the
      latter case the central directory and zipfile comment will be found on
      the last disk(s) of this archive.
    unzip:  cannot find zipfile directory in one of EVA_together.zip.001 or
            EVA_together.zip.001.zip, and cannot find EVA_together.zip.001.ZIP, period.
    

    The Root-Cause

    The issue here is the composite zip file made up from all the EVA_together.zip.001 .. EVA_together.zip.007 files is just a simple split of a large zip file.

    Taken in isolation that means none of these files is a valid well-formed zip file. All are just fragments.

    The Fix

    To recreate the composite zip file you just need to concatenate the individual parts

    $ cat EVA_together.zip.00* >EVA_together.zip
    $ ll -lh EVA_together.zip
    -rw-rw-r-- 1 paul paul 664M Dec  6 09:31 EVA_together.zip
    

    Check that we now have a valid zip file

    $ file EVA_together.zip
    EVA_together.zip: Zip archive data, at least v2.0 to extract, compression method=store
    
    $ unzip -t EVA_together.zip
    Archive:  EVA_together.zip
        testing: EVA_together/            OK
        testing: EVA_together/10021.jpg   OK
        testing: EVA_together/100397.jpg   OK
    ...
        testing: EVA_together/99711.jpg   OK
        testing: EVA_together/99725.jpg   OK
        testing: EVA_together/9993.jpg    OK
        testing: EVA_together/9999.jpg    OK
    No errors detected in compressed data of EVA_together.zip.
    

    I believe that colab allows a shell escape, so writing the concatenation code in Python may not be needed. Depends on your workflow