Search code examples
pythonpython-3.xmultiprocessingpython-multiprocessingscandir

os.scandir and multiprocessing - ThreadPool works, but multi-process Pool doesn't


I have a task in a Python script that was previously mostly IO-bound, so I used ThreadPools and everything worked fine. Now my task is becoming more CPU-bound, so I wanted to switch to Pools with multiple processes.

I thought both interfaces behaved virtually identically, so I just switched the import and I should be good to go. However, suddenly my worker function isn't getting executed in the pool anymore.

After trying a couple of things, this seem to correspond with the fact that I'm passing a DirEntry from os.scandir() to my worker function. Replacing the 'entry' with a hardcoded string, my worker function is executed. Replacing it back with entry, it stops working. Replacing the import with ThreadPool, it works again.

# This works.
from multiprocessing.pool import ThreadPool as Pool
import os

pool_size = 3

def worker(entry):
    print("Did some useful stuff!")

pool = Pool(pool_size)

for entry in os.scandir("Samples/"):
    if entry.is_file():
        pool.apply_async(worker, (entry,))

pool.close()
pool.join()

print("Finished multiprocessing task.")

Output:

Did some useful stuff! (~150x)
Finished multiprocessing task.

Replace from multiprocessing.pool import ThreadPool as Pool with from multiprocessing import Pool, the only output I now get is:

Finished multiprocessing task.

Now, if I insert a random string instead of the entry from the loop into pool.apply_async(worker, (entry,)), so e.g. pool.apply_async(worker, ("Why does this work?",)), the worker function works and returns the same output as with the ThreadPools, but obviously with the argument I don't want to use in my actual script.

What's happening here?


Solution

  • The problem is, stuff that gets passed to a child process is being pickled, and that does not work for DirEntry resulting from scandir. Unfortunately with apply_async you do not get to see corresponding failures. You would with simple apply and that is how I tracked this down, once you see what is going on, it actually does make sense:

    TypeError: can't pickle posix.DirEntry objects
    

    Depending on what you want, you can pass entry.path or other attribute(s) (that can be pickled, so really, also name or you'd have to use return values of its methods) of DirEntry into your worker and your code should work OK as is.


    As for learning about failures, alternatively, you could write a small function such as:

    def print_failed(caught):
        traceback.print_exc(file=sys.stderr)
    

    And register it with your apply_async call by adding: error_callback=print_failed.