I have the following code:
import numpy as np
from multiprocess import Pool
data = np.zeros((50,50))
def foo():
# data = np.zeros((50,50)) # This slows the code.
def bar():
data.shape
with Pool() as pool:
async_results = [pool.apply_async(bar) for x in range(20000)]
out = [async_result.get() for async_result in async_results]
foo()
As written, it takes 3 seconds to run. But when I uncomment the first line of foo()
, the code takes 10 seconds.
Commenting out the initial definition of data
doesn't fix the issue. So I think the bottleneck isn't when data
is initialized. I suspect the problem is passing data
to each of the processes, but I can't confirm this. And I don't know why defining data
outside of foo
would help.
Why is there a discrepancy in speeds?
The discrepancy is because globals get copied to the workers "for free" (either essentially completely free due to fork
ing, or free from the perspective of the parent process because the child processes recreate them on launch), while closure-scoped variables can't be (they're copied in the fork
scenario, not in any other scenario, and even when they're copied, there's no meaningful way to look them up in the child process, so they get copied again for each task).
To support serializing closures, dill
(the extended version of pickle
underlying multiprocess
that allows it to dispatch closure functions at all) has to serialize the array, send it across the IPC mechanisms with the rest of the data for that task, and deserialize it in the worker, repeating this once for every task. It may also be required to use a more complex format for the function itself (there are some weird optimizations they might use to keep nested, but non-closure, functions cheap to serialize, that would break down for true closures).
It's essentially the same problem described in Python multiprocessing - Why is using functools.partial slower than default arguments? caused by dill
trying to solve the same problem that functools.partial
needed to solve to make it picklable. While regular multiprocessing
doesn't support pickling nested functions at all, dill
s support for closures is effectively performing the same work as pickle
ing a partial
, and the same costs get paid.
TL;DR: At global scope, you don't have to package the array data with each task. At closure scope, you do, dramatically increasing the work done to dispatch each task.