I've used Python's multiprocessing
to parallelise a function on a list of various arguments, and each process would freeze halfway through. When this happens, I check top
followed by 1
on the Ubuntu machine and see that the cores are all now mostly idle (https://man7.org/linux/man-pages/man1/top.1.html).
This is my code:
from multiprocessing import Pool, Queue
class Parallelisable:
# Adapted from: https://stackoverflow.com/questions/41992810/python-multiprocessing-across-different-files
def _apply_function(self, function, args_queue):
while not args_queue.empty():
args = args_queue.get()
# Apply function to arguments
function(*args)
def parallelise(self, function, args_list):
queue = Queue()
for args in args_list:
queue.put(args)
pool = Pool(None, self._apply_function, (function, queue,))
pool.close() # signal that we won't submit any more tasks to pool
pool.join() # wait until all processes are done
if __name__ == '__main__':
# Define data_dir, output_dir, filenames and some_frozenset here
# data_dir and output_dir are strings
# filenames is a list of strings
# some_frozenset is a frozenset of strings
Parallelisable().parallelise(some_function, [(filename, data_dir, output_dir, some_frozenset) for filename in filenames])
I suspect that it is due to a deadlock.
I have come up with these possible explanations but they don't make much sense to me:
Parallelisable
object is a shared resource, with the lock acquired by one of the child processes at any one point in time and prevents a self._apply_function()
call in multiple child processes. I don't think this is the case as I've had 2 child processes running at the same time. I'm guessing this can be solved by forcing child processes to call execve
using the multiprocessing
spawn
methodfunction
in parallelise()
and _apply_function()
is a shared resource, similar to point 1 above__main__
but I don't see that as a problem since I'm running on Ubuntu not Windows(filename, data_dir, output_dir, some_frozenset)
isn't threadsafe, which shouldn't be the case since the first 3 are immutable strings, and the last is an immutable set of immutable stringsIs there anything I'm missing?
By the way, I think that I can rewrite the code above like this:
from multiprocessing import Pool
def parallelise_v2(function, args_list):
with Pool(None) as pool: # Use the "spawn" method if I want to call execve
pool.starmap_async(function, args_list)
if __name__ == '__main__':
# Define data_dir, output_dir, filenames and some_frozenset here
parallelise_v2(some_function, [(filename, data_dir, output_dir, some_frozenset) for filename in filenames])
It seemed to be because the child processes were memory-intensive and Python silently killed them. The parent process simply waited for the child processes to return. I checked the reason for killed processes using dmesg -T| grep -E -i -m10 -B20 'killed process'
, where -m
specifies max number of matches to return and -B
specifies number of lines before the match "killed process".
For more possible reasons and troubleshooting tips, look at the question description and the comments in the question. They include a few things to look out for and the suggestion of viztracer
to trace what happened. Quoting from @minker:
viztracer
will output the file if the script is stuck. Just Ctrl-C out of it. Just remember you need--log_multiprocess
formultiprocessing
library when you useviztracer
.