Search code examples
pythonmultiprocessing

Fastest way to extract tar files using Python


I have to extract hundreds of tar.bz files each with size of 5GB. So tried the following code:

import tarfile
from multiprocessing import Pool

files = glob.glob('D:\\*.tar.bz') ##All my files are in D
for f in files:

   tar = tarfile.open (f, 'r:bz2')
   pool = Pool(processes=5)

   pool.map(tar.extractall('E:\\') ###I want to extract them in E
   tar.close()

But the code has type error: TypeError: map() takes at least 3 arguments (2 given)

How can I solve it? Any further ideas to accelerate extracting?


Solution

  • You need to change pool.map(tar.extractall('E:\\') to something like pool.map(tar.extractall(),"list_of_all_files")

    Note that map() takes 2 argument first is a function , second is a iterable , and Apply function to every item of iterable and return a list of the results.

    Edit : you need to pass a TarInfo object into the other process :

    def test_multiproc():
        files = glob.glob('D:\\*.tar.bz2')
        pool  = Pool(processes=5)
        result = pool.map(read_files, files)
    
    
    def read_files(name):
    
     t = tarfile.open (name, 'r:bz2')
     t.extractall('E:\\')
     t.close()
    
    >>>test_multiproc()