I am working on a problem that allows for some rather unproblematic parallelisation. I am having difficulties figuring out what suitable. parallelising mechanisms are available in Python. I am working with python 3.9 on MacOS.
My pipeline is:
get_common_input()
acquires some data in a way not easily parallelisable. If that matters, its return value common_input_1
a list of list of integers.parallel_computation_1()
gets the common_input_1
and an individual input from a list individual_inputs
. The common input is only read.common_input_2
is more or less the collected outputs from parallel_computation_1()`.parallel_computation_2()
then again gets common_input_2
as read only input, plus some individual input.I could do the following:
import multiprocessing
common_input_1 = None
common_input_2 = None
def parallel_computation_1(individual_input):
return sum(1 for i in common_input_1 if i == individual_input)
def parallel_computation_2(individual_input):
return individual_input in common_input_2
def main():
multiprocessing.set_start_method('fork')
global common_input_1
global common_input_2
common_input_1 = [1, 2, 3, 1, 1, 3, 1]
individual_inputs_1 = [0,1,2,3]
individual_inputs_2 = [0,1,2,3,4]
with multiprocessing.Pool() as pool:
common_input_2 = pool.map(parallel_computation_1, individual_inputs_1)
with multiprocessing.Pool() as pool:
common_output = pool.map(parallel_computation_2, individual_inputs_2)
print(common_output)
if __name__ == '__main__':
main()
As suggested in this answer, I use global variables to share the data. That works if I use set_start_method('fork')
(which works for me, but seems to be problematic on MacOS).
Note that if I remove the second with multiprocessing.Pool()
to have just one Pool used for both parallel tasks, things won't work (the processes don't see the new value of common_input_2
).
Apart from the fact that using global variables seems like bad coding style to me (Is it? That's just my gut feeling), the need to start a new pool doesn't please me, as it introduces some probably unnecessary overhead.
What do you think about these concerns, esp. the second one?
Are there good alternatives? I see that I could use multiprocessing.Array
, but since my data are a lists of lists, I would need to flatten it into a single list and use that in parallel_computation
in some nontrivial way. If my shared input was even more complex, I would have to put quite some effort into wrapping this into multiprocessing.Value
or multiprocessing.Array
's.
You can define and compute output_1
as a global variable before creating your process pool; that way each process will have access to the data; this won't result in any memory duplication because you're not changing that data (copy-on-write).
_output_1 = serial_computation()
def parallel_computation(input_2):
# here you can access _output_1
# you must not modify it as this will result in creating new copy in the child process
...
def main():
input_2 = ...
with Pool() as pool:
output_2 = pool.map(parallel_computation, input_2)