Search code examples
pythonmultithreadingtarfile

Python2.7: Untar files in parallel mode (with threading)


I'm learning Python threading and in the same time trying to improve my old untaring script.

The main part of it looks like:

import tarfile, os, threading

def untar(fname, path):
    print "Untarring " + fname
    try:
        ut = tarfile.open(os.path.join(path,fname), "r:gz")
        ut.extractall(path)
        ut.close()
    except tarfile.ReadError as e:          #in case it's not gziped
        print e
        ut = tarfile.open(os.path.join(path,fname), "r:*")
        ut.extractall(path)
        ut.close()

def untarFolder(path):
    if path == ".":
        path = os.getcwd()
    print "path", path
    ListTarFiles = serveMenu(path)         # function what parse folder 
                                           # content for tars, and tar.gz 
                                           # files and return list of them
    print "ListTarFiles ", ListTarFiles 

    for filename in ListTarFiles:
        print "filename: ", filename
        t = threading.Thread(target=untar, args = (filename,path))
        t.daemon = True
        t.start()
        print "Thread:", t

So the goal to untar all files in given folder not one by one but in parallel mode at the same time. Is it possible?

Output:

bogard@testlab:~/Toolz/untar$ python untar01.py -f .
path /home/bogard/Toolz/untar
ListTarFiles ['tar1.tgz', 'tar2.tgz', 'tar3.tgz']
filename:  tar1.tgz
Untarring tar1.tgz
 Thread: <Thread(Thread-1, started daemon 140042104731392)>
filename:  tar2.tgz
Untarring tar2.tgz
 Thread: <Thread(Thread-2, started daemon 140042096338688)>
filename:  tar3.tgz
Untarring tar3.tgz
 Thread: <Thread(Thread-3, started daemon 140042087945984)>

In output can see that script create threads but it doesn't untar any files. What's the catch?


Solution

  • What might be happening is that your script is returning before the threads actually complete. You can wait for a thread to complete with Thread.join(). Maybe try something like this:

    threads = []
    
    for filename in ListTarFiles:
        t = threading.Thread(target=untar, args = (filename,path))
        t.daemon = True
        threads.append(t)
        t.start()
    
    # Wait for each thread to complete
    for thread in threads:
        thread.join()
    

    Also, depending on the number of files you're untarring, you might want to limit the number of jobs that you're launching, so that you're not trying to untar 1000 files at once. You could maybe do this with something like multiprocessing.Pool.