Need to concatenate some files from github which have been split into several pieces due to the size (as from this dataset https://github.com/kang-gnak/eva-dataset)
Using request
these end up in my temporary data storage in the format File_Name.zip.001
to File_Name.zip.007
The completed file is not text but images so I haven't found a straightforward way to rebuild File_Name.zip
from code.
Is anyone aware of a solution that would work directly in Colab?
I am looking for both repeatability and the ability to share my code as a Colab notebook, so I am trying to avoid solutions that involve having to download and rebuild the file locally and reuploading it each time. I would also prefer not to have to make an online copy of existing data if there's a way to rebuild and unzip the file directly from the code.
Thanks in advance.
I attempted using a list of the parts' file names assigned to
data_zip_parts
and run the following code:
with zipfile.ZipFile(data_path / "File_Name.zip", 'a') as full_zip:
for file_name in data_zip_parts:
part = zipfile.ZipFile(data_path / file_name, 'r')
for name in part.namelist():
full_zip.writestr(name, zipfile.open(name).read())
However looks like this file format cannot be read directly so I get the following error:
BadZipFile: File is not a zip file
Just a reminder that I want to try to do this directly within Google Colab: I have asked a few peers but most of them gave me solutions to run on my local system such as command line or using 7zip, which isn't quite what I'm looking for, but I expect there may be a way to work around this format, and would appreciate the assistance.
I downloaded the dataset from https://github.com/kang-gnak/eva-dataset to see what you are dealing with
$ ls -lh *
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.001
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.002
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.003
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.004
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.005
-rw-rw-r-- 1 paul paul 99M Oct 7 04:11 EVA_together.zip.006
-rw-rw-r-- 1 paul paul 70M Oct 7 04:11 EVA_together.zip.007
Let's see what the file
command says about the content of these files
$ file *
EVA_together.zip.001: Zip archive data, at least v2.0 to extract, compression method=store
EVA_together.zip.002: data
EVA_together.zip.003: data
EVA_together.zip.004: data
EVA_together.zip.005: data
EVA_together.zip.006: data
EVA_together.zip.007: OpenPGP Public Key
As I expected, only the first is actually appears to be a zip file, but even it has problems
$ unzip -t EVA_together.zip.001
Archive: EVA_together.zip.001
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of EVA_together.zip.001 or
EVA_together.zip.001.zip, and cannot find EVA_together.zip.001.ZIP, period.
The issue here is the composite zip file made up from all the EVA_together.zip.001
.. EVA_together.zip.007
files is just a simple split of a large zip file.
Taken in isolation that means none of these files is a valid well-formed zip file. All are just fragments.
To recreate the composite zip file you just need to concatenate the individual parts
$ cat EVA_together.zip.00* >EVA_together.zip
$ ll -lh EVA_together.zip
-rw-rw-r-- 1 paul paul 664M Dec 6 09:31 EVA_together.zip
Check that we now have a valid zip file
$ file EVA_together.zip
EVA_together.zip: Zip archive data, at least v2.0 to extract, compression method=store
$ unzip -t EVA_together.zip
Archive: EVA_together.zip
testing: EVA_together/ OK
testing: EVA_together/10021.jpg OK
testing: EVA_together/100397.jpg OK
...
testing: EVA_together/99711.jpg OK
testing: EVA_together/99725.jpg OK
testing: EVA_together/9993.jpg OK
testing: EVA_together/9999.jpg OK
No errors detected in compressed data of EVA_together.zip.
I believe that colab allows a shell escape, so writing the concatenation code in Python may not be needed. Depends on your workflow