c performance parallel-processing mpi hpc

How to send the last element array of each processor in MPI

I am struggled to write the code to perform like the following example similar to the Up Phase part in prefix scan and not want to use the function MPI_Scan:

WholeArray[16] = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Processor 0 got [0 , 1 , 2 , 3] , Processor 1 got [4 , 5 , 6 , 7] 

Processor 2 got [8 , 9 , 10 , 11] , Processor 3 got [12 , 13 , 14 , 15]

To send and sum the last array with 2 strides:

(stride 1)

Processor 0 send Array[3] , Processor 1 receive from Processor 0 and add to Array[3]

Processor 2 send Array[3], Processor 3 receive from Processor 2 and add to Array[3] 

(stride 2)

Processor 1 sends Array[3], Processor 3 receive from Processor 1 and add to Array[3]

At last I would want to use MPI_Gather to let the result be:

WholeArray = [0 , 1 , 2 , 3 , 4 , 5 , 6 ,10 , 8 , 9 , 10 , 11 , 12 , 13 ,14 , 36]

I find it hard to write code to let the program do like the following 4nodes example:

(1st stride) - Processor 0 send to Processor 1 and Processor 1 receive from Processor 0
(1st stride) - Processor 2 send to Processor 3 and Processor 3 receive from Processor 2

(2nd stride) - Processor 1 send to Processor 3 and Processor 3 receive from Processor 1

Here is the code that I have written so far:

int Send_Receive(int* my_input, int size_per_process, int rank, int size)
{

    int key = 1;
    int temp = my_input[size_per_process-1];

    while(key <= size/2)
{
    if((rank+1) % key == 0)
      {
        if(rank/key % 2 == 0)
        {
            MPI_Send(&temp, 1, MPI_INT, rank+key,0,MPI_COMM_WORLD);
        }
        else
        {
            MPI_Recv(&temp, 1, MPI_INT, rank-key,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
            my_input[size_per_process]+= temp;
        }
        key = 2 * key;
        MPI_Barrier(MPI_COMM_WORLD);
      }
}

return (*my_input);

}

Solution

There is some issues in your code, namely 1) it always send the same temp variable across processes.

MPI_Send(&temp, 1, MPI_INT, rank+key,0,MPI_COMM_WORLD);

The temp variable is initialized before the loop:

 int temp = my_input[size_per_process-1];
 while(key <= size/2)
 { ...}

but never updated inside the loop. This leads to wrong results, since after the first stride the last element of the my_input array will be different for some processes. Instead you should do:

temp = localdata[size_per_process-1];
MPI_Send(&temp, 1, MPI_INT, rank+key, 0, MPI_COMM_WORLD);

Moreover, 2) the following statement

my_input[size_per_process]+= temp;

does not add temp to the last position of the array my_input. Instead, it should be:

my_input[size_per_process-1]+= temp;

Finally, 3) there is deadlock and infinite loop issues. For starters having a call to a collective communication routine such as MPI_barrier inside a single conditional is typically a big red flag. Instead of:

while(key <= size/2)
{
   if((rank+1) % key == 0){
       ...
       MPI_Barrier(MPI_COMM_WORLD);
   }
}

you should have:

while(key <= size/2)
{
   if((rank+1) % key == 0){
       ...
   }
   MPI_Barrier(MPI_COMM_WORLD);
}

to ensure that every process calls the MPI_Barrier.

The infinite loop happens because the while condition depends on the update of the key, but the key is only update when if((rank+1) % key == 0) evaluates to true. Therefore, when if((rank+1) % key == 0) evaluates to false, the process will never update the key, and consequently, gets stuck in an infinite loop.

I running example with all the problems fixed :

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv){
    int rank, mpisize, total_size = 16;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
    int *data = NULL;   

    if(rank == 0){
       data = malloc(total_size * sizeof(int));
       for(int i = 0; i < total_size; i++)
          data[i] = i;
    }
    int size_per_process = total_size / mpisize;
    int *localdata = malloc(size_per_process * sizeof(int));
    MPI_Scatter(data, size_per_process, MPI_INT, localdata, size_per_process, MPI_INT, 0, MPI_COMM_WORLD);

    int key = 1;
    int temp = 0;
    while(key <= mpisize/2){                      
      if((rank+1) % key == 0){
          if(rank/key % 2 == 0){    
             temp = localdata[size_per_process-1];
             MPI_Send(&temp, 1, MPI_INT, rank+key, 0, MPI_COMM_WORLD);
          }
          else {
             MPI_Recv(&temp, 1, MPI_INT, rank-key, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
             localdata[size_per_process-1]+= temp;
         }
      }
      key = 2 * key;
      MPI_Barrier(MPI_COMM_WORLD);
    }

    MPI_Gather(localdata, size_per_process, MPI_INT, data, size_per_process, MPI_INT, 0, MPI_COMM_WORLD);

    if(rank == 0){
       for(int i = 0; i < total_size; i++)
               printf("%d ", data[i]);
       printf("\n");
    }
    free(data);
    free(localdata);    
    MPI_Finalize();
    return 0;
}

Input:

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Output:

[0,1,2,3,4,5,6,10,8,9,10,11,12,13,14,36]