Search code examples
pythonpython-multiprocessing

Multiprocess Slower When Variable Defined Inside Function


I have the following code:

import numpy as np
from multiprocess import Pool

data = np.zeros((50,50))

def foo():
    # data = np.zeros((50,50)) # This slows the code.
    
    def bar():
        data.shape
        
    with Pool() as pool:
            async_results = [pool.apply_async(bar) for x in range(20000)]
            out = [async_result.get() for async_result in async_results]
  
foo()

As written, it takes 3 seconds to run. But when I uncomment the first line of foo(), the code takes 10 seconds.

Commenting out the initial definition of data doesn't fix the issue. So I think the bottleneck isn't when data is initialized. I suspect the problem is passing data to each of the processes, but I can't confirm this. And I don't know why defining data outside of foo would help.

Why is there a discrepancy in speeds?


Solution

  • The discrepancy is because globals get copied to the workers "for free" (either essentially completely free due to forking, or free from the perspective of the parent process because the child processes recreate them on launch), while closure-scoped variables can't be (they're copied in the fork scenario, not in any other scenario, and even when they're copied, there's no meaningful way to look them up in the child process, so they get copied again for each task).

    To support serializing closures, dill (the extended version of pickle underlying multiprocess that allows it to dispatch closure functions at all) has to serialize the array, send it across the IPC mechanisms with the rest of the data for that task, and deserialize it in the worker, repeating this once for every task. It may also be required to use a more complex format for the function itself (there are some weird optimizations they might use to keep nested, but non-closure, functions cheap to serialize, that would break down for true closures).

    It's essentially the same problem described in Python multiprocessing - Why is using functools.partial slower than default arguments? caused by dill trying to solve the same problem that functools.partial needed to solve to make it picklable. While regular multiprocessing doesn't support pickling nested functions at all, dills support for closures is effectively performing the same work as pickleing a partial, and the same costs get paid.

    TL;DR: At global scope, you don't have to package the array data with each task. At closure scope, you do, dramatically increasing the work done to dispatch each task.