Search code examples
pythondictionaryparallel-processingmultiprocessinglarge-data

When to and when not to use map() with multiprocessing.Pool, in Python? case of big input values


Is it efficient to calculate many results in parallel with multiprocessing.Pool.map() in a situation where each input value is large (say 500 MB), but where input values general contain the same large object? I am afraid that the way multiprocessing works is by sending a pickled version of each input value to each worker process in the pool. If no optimization is performed, this would mean sending a lot of data for each input value in map(). Is this the case? I quickly had a look at the multiprocessing code but did not find anything obvious.

More generally, what simple parallelization strategy would you recommend so as to do a map() on say 10,000 values, each of them being a tuple (vector, very_large_matrix), where the vectors are always different, but where there are say only 5 different very large matrices?

PS: the big input matrices actually appear "progressively": 2,000 vectors are first sent along with the first matrix, then 2,000 vectors are sent with the second matrix, etc.


Solution

  • I think that the obvious solution is to send a reference to the very_large_matrix instead of a clone of the object itself? If there are only five big matrices, create them in the main process. Then when the multiprocessing.Pool is instantiated it will create a number of child processes that clones the parent process' address space. That means that if there are six processes in the pool, there will be (1 + 6) * 5 different matrices in memory simultaneously.

    So in the main process create a lookup of all unique matrices:

    matrix_lookup = {1 : matrix(...), 2 : matrix(...), ...}
    

    Then pass the index of each matrix in the matrix_lookup along with the vectors to the worker processes:

    p = Pool(6)
    pool.map(func, [(vector, 1), (vector, 2), (vector, 1), ...])