Most efficent way to assign lists of irregular length to sub-processes for processing

I have a number of objects (roughly 530,000). These objects are randomly assigned to a set of lists (not actually random but let's assume it is). These lists are indexed consecutively and assigned to a dictionary, called groups, according to their index. I know the total number of objects but I do not know the length of each list ahead of time (which in this particular case happens to vary between 1 and 36000).

Next I have to process each object contained in the lists. In order to speed up this operation I am using MPI to send them to different processes. The naive way to do this is to simply assign each process len(groups)/size (where size contains the number of processes used) lists, assign any possible remainder, have it process the contained objects, return the results and wait. This obviously means, however, that if one process gets, say, a lot of very short lists and another all the very long lists the first process will sit idle most of the time and the performance gain will not be very large.

What would be the most efficient way to assign the lists? One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible. But I am not sure how to best implement this. Does anybody have any suggestions?

Solution

One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible.

Assuming that processing time scales exactly with the sum of list lengths, and your processor capacity is homogeneous, this is in fact what you want. This is called the multiprocessor scheduling problem, which is very close to the bin packing problem, but with a constant number of bins minimizing the maximum capacity.

Generally this is a NP-hard problem, so you will not get a perfect solution. The simplest reasonable approach is to greedily pick the largest chunk of work for the processor that has the smallest work assigned to it yet.

It is trivial to implement this in python (examples uses a list of lists):

greedy = [[] for _ in range(nprocs)]
for group in sorted(groups, key=len, reverse=True):
    smallest_index = np.argmin([sum(map(len, assignment)) for assignment in greedy])
    greedy[smallest_index].append(group)

If you have a large number of processors you may want to optimize the smallest_index computation by using a priority queue. This will produce significantly better results than the naive sorted split as recommended by Attersson:

(https://gist.github.com/Zulan/cef67fa436acd8edc5e5636482a239f8)