Search code examples
fortranmpidistributed-computing

Distributing uneven workload using MPI


I've got an Array

A(1:n_max)

which I would like to scatter with MPI in order to evaluate some f(A(j)). However the evaluation of f(A(1)) takes 0.35s and the evaluation of f(A(n_max)) takes 15s. I have different ideas on how to tackle it, but I'm not sure which is the best:

  1. Some Master/Slave work load distributing.

    Along the lines of this: http://www.hpc.cam.ac.uk/using-clusters/compiling-and-development/parallel-programming-mpi-example

  2. I reorder A with Reshape(Transpose(Reshape(A)), which would turn

    array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98])

into

array([[ 0, 33, 66,  1, 34, 67,  2, 35, 68,  3, 36, 69,  4, 37, 70,  5, 38,
        71,  6, 39, 72,  7, 40, 73,  8, 41, 74,  9, 42, 75, 10, 43, 76, 11,
        44, 77, 12, 45, 78, 13, 46, 79, 14, 47, 80, 15, 48, 81, 16, 49, 82,
        17, 50, 83, 18, 51, 84, 19, 52, 85, 20, 53, 86, 21, 54, 87, 22, 55,
        88, 23, 56, 89, 24, 57, 90, 25, 58, 91, 26, 59, 92, 27, 60, 93, 28,
        61, 94, 29, 62, 95, 30, 63, 96, 31, 64, 97, 32, 65, 98]])

which I then could distribute using scatter and gather. The problem is, that one process would have to run f(A(0)), f(A(33)) and f(A(66)), while another has to run f(A(32)), f(A(65)) and f(A(98)).

Sadly the run times in between rise monotonic, but not linear.

  1. I'm hoping for some of your Ideas

What do you recommend?


Solution

  • If the order of execution is important (e.g. caches) you could split the array in contiguous groups of different sizes but almost equal workloads and distribute them with MPI_SCATTERV. If the end of the array is to heavy (in case of work load) you could also split the array and use the same approach twice.

    If the order of execution is not important and you have reordering doesn't take to much time I would prefer your second approach.

    If you have always an array like this, you should think about the first solution but only send the interval limits instead of all numbers inside the interval. This makes especially sense if you are bandwidth-bound in your communication.