Search code examples
google-drive-apizipcompressiongoogle-colaboratory7zip

How should one zip a large folder in Windows 10, upload it to GDrive, then unzip it?


I have a directory consisting of 22 sub-directories. Altogether, the directory is about 750GB in size and I need this data on GDrive so that I can work with it in Google Colab. Obviously uploading this takes an absolute age (particularly with my slow connection) so I would like to zip it, upload it, then unzip it in the cloud. I am using 7zip and zipping each subdirectory using the zip format and "normal" compression level. (EDIT: Can now confirm that I get the same error for 7z and tar format). Each subdirectory ends up between 14 and 20GB in size. I then upload this and attempt to unzip it in Google Colab using the following code:

drive.mount('/content/gdrive/')
!apt-get install p7zip-full
!7za x "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -o"/content/gdrive/My Drive/unzipped_av_tfrecords/" -aos

This extracts some portion of the zip file before throwing an error. There are a variety of errors and sometimes the code will not even begin unzipping the file before throwing an error. This is the most common error:

Can not open the file as archive

ERROR: Unknown error -2147024891

Archives with Errors: 1

If I then attempt to rerun the !7za command, it may extract one or 2 files more from the zip file before throwing this error:

terminate called after throwing an instance of 'CInBufferException'

It may also complain about particular files within the zip archive:

ERROR: Headers Error : drumming/yt-g0fi0iLRJCE_23.tfrecords

I have also tried using:

!unzip -n "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -d "/content/gdrive/My Drive/unzipped_av_tfrecords/"

But that just begins throwing errors:

file #254:  bad zipfile offset (lseek):  8137146368

file #255:  bad zipfile offset (lseek):  8168710144

file #256:  bad zipfile offset (lseek):  8207515648

Although I would prefer a solution in Colab, I have also tried using an app available in GDrive named "Zip Extractor". But that too throws an error and has a dataquota.

This has now happened across 4 zip files and each time I try something new, it takes an a long time to try it out because of the upload speeds. Any explanations for why this is happening and how I can resolve the issue would be greatly appreciated. Also I understand there are probably alternatives to what I am trying to do and they would be appreciated also, even if they do not directly answer the question. Thank you!


Solution

  • I got same problem

    Solve it by

    new ProcessBuilder(new String[] {"7z", "x", fPath, "-o" + dir)
    

    Use command line array, not just full line!

    Luck!

    Why does this command behave differently depending on whether it's called from terminal.app or a scala program?