Search code examples
pythonlistmultiprocessingpoolfunctools

Parallelizing a function with multiple lists arguments with python's multiprocessing


I hope this isn't a duplicate, but I couldn't find a fully satisfying answer for this specific problem.

Given a function with multiple list arguments and one iterable, e.g. here with two lists

def function(list1, list2, iterable):
    i1 = 2*iterable
    i2 = 2*iterable+1
    list1[i1] *= 2
    list2[i2] += 2
    return(list1, list2)

Each list get accesed at different entries therefore the operations are seperated and can be parallized. What is the best way to do this with python's multiprocessing?

One easy way of parallelization would be by using the map-function:

import multiprocessing as mp
from functools import partial

list1, list2 = [1,1,1,1,1], [2,2,2,2,2]
func = partial(function, list1, list2)
pool = mp.Pool()
pool.map(func, [0,1])

The problem is if one does so one produces for every process a copy of the lists (if I understand the map-function right) and work then in parallel at different position in those copies. At the end (after the two iterables [0,1] has been touched) the result of pool.map is

[([3, 1, 1, 1, 1], [2, 4, 2, 2, 2]), ([1, 1, 3, 1, 1], [2, 2, 2, 4, 2])]

but I want

[([3, 1, 3, 1, 1], [2, 4, 2, 4, 2])].

How to achieve this? Should one split the list's by the iterable before, run the specific operations in parallel and then merge them again?

Thanks in advance and excuse please if I mix something up, I just started to use the multiprocessing-library.

EDIT: Operations on different parts on a list can be parallized without synchronization, operations on the whole list can not be parallized (without synchronization). Therefore a solution to my specific problem is to split the lists and the function into the operations and into parts of the lists. After that one merges the parts of the lists to get the whole list back.


Solution

  • You cannot share memory between processes (technically, you can on fork-based systems provided you don't change objects/affect ref count which would rarely ever happen in a real-world usage) - your options are to either use a shared structure (most of them available under the multiprocessing.Manager()) which will do the synchronization/updates for you, or to pass only the data needed for processing and then stitch back together the result.

    Your example is simple enough for both approaches to work without serious penalties so I'd just go with a manager:

    import multiprocessing
    import functools
    
    def your_function(list1, list2, iterable):
        i1 = 2 * iterable
        i2 = 2 * iterable + 1
        list1[i1] *= 2
        list2[i2] += 2
    
    if __name__ == "__main__":  # a multi-processing guard for cross-platform use
        manager = multiprocessing.Manager()
        l1 = manager.list([1, 1, 1, 1, 1])
        l2 = manager.list([2, 2, 2, 2, 2])
        func = functools.partial(your_function, l1, l2)
        pool = multiprocessing.Pool()
        pool.map(func, [0, 1])
        print(l1, l2)  # [2, 1, 2, 1, 1] [2, 4, 2, 4, 2]
    

    Or if your use case is more favorable to stitching the data after processing:

    import multiprocessing
    import functools
    
    def your_function(list1, list2, iterable):
        i1 = 2 * iterable
        i2 = 2 * iterable + 1
        return (i1, list1[i1] * 2), (i2, list2[i2] + 2)  # return the changed index and value
    
    if __name__ == "__main__":  # a multi-processing guard for cross-platform use
        l1 = [1, 1, 1, 1, 1]
        l2 = [2, 2, 2, 2, 2]
        func = functools.partial(your_function, l1, l2)
        pool = multiprocessing.Pool()
        results = pool.map(func, [0, 1])
        for r1, r2 in results:  # stitch the results back into l1 and l2
            l1[r1[0]] = r1[1]
            l2[r2[0]] = r2[1]
        print(l1, l2)  # [2, 1, 2, 1, 1] [2, 4, 2, 4, 2]
    

    That being said, the output is not what you've listed/expected but it is what should happen based on your function.

    Also, if your case is this simple you might want to steer clear from multiprocessing altogether - the overhead multiprocessing adds (plus the manager synchronization) is not worth it unless your_function() does some really CPU-intensitve task.