Search code examples
pythonmultithreadingparallel-processingmultiprocessingcompression

Parallelization of un-bzipping millions of files


I have millions of compressed .bz2 files which I need to uncompressed.

Can uncompression be parallelized ? I have access to the server with many cpu cores for the purpose.

I worked with the following code which is correct but it is extremely slow.

import os, glob, bz2

files = glob.glob("/data01/*.bz2")
for fi in files:
fo = fi[:-4]
  with bz2.BZ2File(fi) as fr, open(fo, "wb") as fw:
   shutil.copyfileobj(fr, fw)

Solution

  • Multithreading would ideal for this because it's primarily IO-bound.

    from concurrent.futures import ThreadPoolExecutor
    import glob
    import bz2
    import shutil
    
    def process(filename):
        with bz2.BZ2File(filename) as fr, open(filename[:-4], "wb") as fw:
            shutil.copyfileobj(fr, fw)
    
    def main():
        with ThreadPoolExecutor() as tpe:
            tpe.map(process, glob.glob('/data01/*.bz2'))
    
    if __name__ == '__main__':
        main()