I have a python script that concurrently processes numpy arrays and images in a random way. To have proper randomness inside the spawned processes I pass a random seed from the main process to the workers for them to be seeded.
When I use maxtasksperchild
for the Pool
, my script hangs after running Pool.map
a number of times.
The following is a minimal snippet that reproduces the problem :
# This code stops after multiprocessing.Pool workers are replaced one single time.
# They are replaced due to maxtasksperchild parameter to Pool
from multiprocessing import Pool
import numpy as np
def worker(n):
# Removing np.random.seed solves the issue
np.random.seed(1) #any seed value
return 1234 # trivial return value
# Removing maxtasksperchild solves the issue
ppool = Pool(20 , maxtasksperchild=5)
i=0
while True:
i += 1
# Removing np.random.randint(10) or taking it out of the loop solves the issue
rand = np.random.randint(10)
l = [3] # trivial input to ppool.map
result = ppool.map(worker, l)
print i,result[0]
This is the output
1 1234 2 1234 3 1234 . . . 99 1234 100 1234 # at this point workers should've reached maxtasksperchild tasks 101 1234 102 1234 103 1234 104 1234 105 1234 106 1234 107 1234 108 1234 109 1234 110 1234
then hangs indefinitely.
I could potentially replace numpy.random
with python's random
and get away with the problem. However in my actual application, the worker will execute user code (given as argument to the worker) which i have no control over, and would like to allow using numpy.random
functions in that user code. So I intentionally want to seed the global random generator (for each process independently).
This was tested with Python 2.7.10, numpy 1.11.0, 1.12.0 & 1.13.0, Ubuntu and OSX
It turns out this is coming from a Python buggy interaction of threading.Lock
and multiprocessing
.
np.random.seed
and most np.random.*
functions use a threading.Lock
to ensure thread-safety. A np.random.*
function generates a random number then update the seed (shared across threads), that's why a lock is needed. See np.random.seed and cont0_array (used by np.random.random()
and others).
Now how does this cause a problem in the above snippet ?
In a nutshell, the snippet hangs because the threading.Lock
state is inherited when forking. So when a child is forked at the same time the lock is acquired in the parent (by np.random.randint(10)
), the child deadlocks (at np.random.seed
).
@njsmith explains it in this github issue https://github.com/numpy/numpy/issues/9248#issuecomment-308054786
multiprocessing.Pool spawns a background thread to manage workers: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173
It loops in the background calling _maintain_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366
If a worker exits, for example due to a maxtasksperchild limit, then _maintain_pool calls _repopulate_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240
And then _repopulate_pool forks some new workers, still in this background thread: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224
So what's happening is that eventually you get unlucky, and at the same moment that your main thread is calling some np.random function and holding the lock, multiprocessing decides to fork a child, which starts out with the np.random lock already held but the thread that was holding it is gone. Then the child tries to call into np.random, which requires taking the lock, and so the child deadlocks.
The simple workaround here is to not use fork with multiprocessing. If you use the spawn or forkserver start methods then this should go away.
For a proper fix.... ughhh. I guess we.. need to register a pthread_atfork pre-fork handler that takes the np.random lock before fork and then releases it afterwards? And really I guess we need to do this for every lock in numpy, which requires something like keeping a weakset of every RandomState object, and _FFTCache also appears to have a lock...
(On the plus side, this would also give us an opportunity to reinitialize the global random state in the child, which we really should be doing in cases where the user hasn't explicitly seeded it.)