python performance dictionary multiprocessing pool

Unexpected performance of multiprocessing.Pool()

I find that multiprocessing.Pool() doesn't behave as expected in my case below. Could anyone explain why it behaved in the way and how to improve the performance if possible. Following is just simplistic code:

import numpy as np
import multiprocessing  
from itertools import repeat

def group_data_by_runID(args):
    data, runID = args
    return data[data[:,0].astype(int)==runID,:]

%%time
DATA = np.array([[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[1,8],[2,9],[2,10],[2,11],[2,12]])
runIDs = [0,1,2]*10000000
pool = multiprocessing.Pool(40)
list(pool.map(group_data_by_runID, zip(repeat(DATA), runIDs)))

As you can see in the above code that I intended to use 40 cores (56 cores and far more than enough memory available on this system) to run the code, it took 1min 31s. Then I used:

list(map(group_data_by_runID, zip(repeat(DATA), runIDs)))

It took 2min 33s. So the performance of using 40 cores only again less than twice performance, which is very weird to me. I also notice that even I 40 cores, it sometimes doesn't actually launch it in 40 cores as it can be seen in htop.

Where I did wrong? And how can I improve the speed. Please note that the actual data is much larger.

Solution

Maybe there are still many people like me who are confused by the performance of multiprocessing in python. Sometime you may achieve a performance gain and sometime you may even get worse performance. Thus I decided to answer this question by myself according to my own experience with multiprocessing.

There may be a overhead by using multiprocessing if your input data is large because these data would be copied and sent across the wire to different processes as juanpa commented above. This overhead could be very significant. However, we can still get a huge performance gain by chopping the input data into small chunks and let each process handle each chunk.

Another scenario where a significant performance gain can be achieved is there is no input data. Such as reading data from tens or hundreds of files.

Although multiprocessing can boost the speed, the majority of energy shall still be spent on the algorithm itself, which may fundamentally determine the efficiency of the code.