I have millions of compressed .bz2
files which I need to uncompressed.
Can uncompression be parallelized ? I have access to the server with many cpu cores for the purpose.
I worked with the following code which is correct but it is extremely slow.
import os, glob, bz2
files = glob.glob("/data01/*.bz2")
for fi in files:
fo = fi[:-4]
with bz2.BZ2File(fi) as fr, open(fo, "wb") as fw:
shutil.copyfileobj(fr, fw)
Multithreading would ideal for this because it's primarily IO-bound.
from concurrent.futures import ThreadPoolExecutor
import glob
import bz2
import shutil
def process(filename):
with bz2.BZ2File(filename) as fr, open(filename[:-4], "wb") as fw:
shutil.copyfileobj(fr, fw)
def main():
with ThreadPoolExecutor() as tpe:
tpe.map(process, glob.glob('/data01/*.bz2'))
if __name__ == '__main__':
main()