Search code examples

When to and when not to use map() with multiprocessing.Pool, in Python? case of big input values

Is it efficient to calculate many results in parallel with in a situation where each input value is large (say 500 MB), but where input values general contain the same large object? I am afraid that the way multiprocessing works is by sending a pickled version of each input value to each worker process in the pool. If no optimization is performed, this would mean sending a lot of data for each input value in map(). Is this the case? I quickly had a look at the multiprocessing code but did not find anything obvious.

More generally, what simple parallelization strategy would you recommend so as to do a map() on say 10,000 values, each of them being a tuple (vector, very_large_matrix), where the vectors are always different, but where there are say only 5 different very large matrices?

PS: the big input matrices actually appear "progressively": 2,000 vectors are first sent along with the first matrix, then 2,000 vectors are sent with the second matrix, etc.


  • I think that the obvious solution is to send a reference to the very_large_matrix instead of a clone of the object itself? If there are only five big matrices, create them in the main process. Then when the multiprocessing.Pool is instantiated it will create a number of child processes that clones the parent process' address space. That means that if there are six processes in the pool, there will be (1 + 6) * 5 different matrices in memory simultaneously.

    So in the main process create a lookup of all unique matrices:

    matrix_lookup = {1 : matrix(...), 2 : matrix(...), ...}

    Then pass the index of each matrix in the matrix_lookup along with the vectors to the worker processes:

    p = Pool(6), [(vector, 1), (vector, 2), (vector, 1), ...])