Search code examples
pythonpython-3.xoptimizationzipunzip

Single file unzip optimization using python


I have a large zip file that contains 1 file inside. I want to unzip that file to a given directory for further processing and used this code:

def unzip(zipfile: ZipFile, filename: str, dest: str):
    ZipFile.extract(zipfile, filename, dest)

This function is called using:

 with ZipFile(file_path, "r") as zip_source:
    unzip(zip_source, zip_source.infolist()[0], extract_path) # extract path is correctly defined earlier in the code

It seems like unzipping a large file takes a long time (file size > 500 Mb) and I would like to optimize this solution.

All the optimizations I found were multiprocessing based in order to make the extraction of multiple files faster, however, my zip contains only a single file so multiprocessing doesn't seem to be the answer.


Solution

  • You cannot parallelize the decompression of a zip file with 1 file inside as long are the file is actually compressed using the usual decompression algorithms LZ77/LZW/LZSS. These algorithm are intrinsically sequential.

    Moreover these decompression methods are known to be slow (often much slower than reading the file from a storage device). This is mainly because of the algorithm themselves: their complexity and the fact that most mainstream processors cannot speed the computation up by a large margin.

    Thus, there is no way to decompress the file faster, although you might find a slightly faster implementation by using another library.