Search code examples
bashshellzip

Zip directory in different batches


I'm trying to zip a massive directory with images that will be fed into a deep learning system. This is incredibly time consuming, so I would like to stop prematurely the zipping proccess with Ctrl + C and zip the directory in different "batches".

Currently I'm using zip -r9v folder.zip folder, and I've seen that the option -u allows to update changed files and add new ones.

I'm worried about some file or the zip itself ending up corrupted if I terminate the process with Ctrl + C. From this answer I understand that the cp can be terminated safely, and this other answer suggests that gzip is also safe.

Putting it all together: Is it safe to end prematurely the zip command? Is the -u option viable for zipping in different batches?


Solution

  • Is it safe to end prematurely the zip command?

    In my tests, canceling zip (Info-ZIP, 16 June 2008 (v3.0)) using CtrlC did not create a zip-archive at all, even when the already compressed data was 2.5GB. Therefore, I would say CtrlC is "safe" (you won't end up with a corrupted file, but also pointless (you did all the work for nothing).

    Is the -u option viable for zipping in different batches?

    Yes. Zip archives compress each file individually, so the archives you get from adding files later on are as good as adding all files in a single run. Just remember that starting zip takes time too. So set the batch size as high as acceptable to save time.

    Here is a script that adds all your files to the zip archive, but gives a chance to stop the compression at every 100th file.

    #! /bin/bash
    batchsize=100
    shopt -s globstar
    files=(folder/**)
    echo "Press enter to stop compression after this batch."
    for ((startfile=0; startfile<"${#files[@]}"; startfile+=batchsize)); do
      ((startfile==0)) && u= || u=u
      zip "-r9v$u" folder.zip "${files[@]:startfile:batchsize}" 
      u=u
      if read -t 0; then
        echo "Compression stopped before file $startfile."
        echo "Re-run this script with startfile=$startfile to continue".
        exit
      fi
    done
    

    For more speed you might want to look into alternative zip implementations.