Search code examples
pythonmultiprocessingpython-multiprocessing

Providing shared read-only ressources to parallel processes


I am working on a problem that allows for some rather unproblematic parallelisation. I am having difficulties figuring out what suitable. parallelising mechanisms are available in Python. I am working with python 3.9 on MacOS.

My pipeline is:

  • get_common_input() acquires some data in a way not easily parallelisable. If that matters, its return value common_input_1 a list of list of integers.
  • parallel_computation_1() gets the common_input_1 and an individual input from a list individual_inputs. The common input is only read.
  • common_input_2 is more or less the collected outputs from parallel_computation_1()`.
  • parallel_computation_2() then again gets common_input_2 as read only input, plus some individual input.

I could do the following:

import multiprocessing
common_input_1 = None
common_input_2 = None

def parallel_computation_1(individual_input):
    return sum(1 for i in common_input_1 if i == individual_input)

def parallel_computation_2(individual_input):
    return individual_input in common_input_2

def main():
    multiprocessing.set_start_method('fork')
    global common_input_1
    global common_input_2
    common_input_1      = [1, 2, 3, 1, 1, 3, 1]
    individual_inputs_1 = [0,1,2,3]
    individual_inputs_2 = [0,1,2,3,4]
    with multiprocessing.Pool() as pool:
        common_input_2 = pool.map(parallel_computation_1, individual_inputs_1)
    with multiprocessing.Pool() as pool:
        common_output = pool.map(parallel_computation_2, individual_inputs_2)
    print(common_output)

if __name__ == '__main__':
    main()

As suggested in this answer, I use global variables to share the data. That works if I use set_start_method('fork') (which works for me, but seems to be problematic on MacOS).

Note that if I remove the second with multiprocessing.Pool() to have just one Pool used for both parallel tasks, things won't work (the processes don't see the new value of common_input_2).

Apart from the fact that using global variables seems like bad coding style to me (Is it? That's just my gut feeling), the need to start a new pool doesn't please me, as it introduces some probably unnecessary overhead.

What do you think about these concerns, esp. the second one?

Are there good alternatives? I see that I could use multiprocessing.Array, but since my data are a lists of lists, I would need to flatten it into a single list and use that in parallel_computation in some nontrivial way. If my shared input was even more complex, I would have to put quite some effort into wrapping this into multiprocessing.Value or multiprocessing.Array's.


Solution

  • You can define and compute output_1 as a global variable before creating your process pool; that way each process will have access to the data; this won't result in any memory duplication because you're not changing that data (copy-on-write).

    _output_1 = serial_computation()
    
    
    def parallel_computation(input_2):
        # here you can access _output_1
        # you must not modify it as this will result in creating new copy in the child process
        ...
    
    
    def main():
        input_2 = ...
        with Pool() as pool:
            output_2 = pool.map(parallel_computation, input_2)