Search code examples
pythonpython-3.xmultiprocessingpython-multiprocessingpool

Can I map a subprocess to the same multiprocessing.Pool where the main process is running?


I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.

I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.

Here is how the code looks like:

max_processes = 7 
# it is passed by command line actually but not relevant here

def main_function( ... ):

    res_1 = sub_function_1( ... )
    res_2 = sub_function_2( ... )

if __name__ == '__main__':

    p = Pool(max_processes)
    Arguments = []

    for x in Paths.keys():
        # generation of the arguments
        ... 
        Arguments.append( Tup_of_arguments )

    p.map(main_function, Arguments)

    p.close()
    p.join()

As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?


Solution

  • No, you can't.
    The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.

    spawn
    A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.

    fork
    The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
    However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
    Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
    Anyway, that pool object is a copy of the one in the main process.

    forkserver
    I could not test this start method

    To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
    You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.

    As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.