Search code examples
cperformanceparallel-processingmpihpc

How to send the last element array of each processor in MPI


I am struggled to write the code to perform like the following example similar to the Up Phase part in prefix scan and not want to use the function MPI_Scan:

WholeArray[16] = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Processor 0 got [0 , 1 , 2 , 3] , Processor 1 got [4 , 5 , 6 , 7] 

Processor 2 got [8 , 9 , 10 , 11] , Processor 3 got [12 , 13 , 14 , 15] 

To send and sum the last array with 2 strides:

(stride 1)

Processor 0 send Array[3] , Processor 1 receive from Processor 0 and add to Array[3]

Processor 2 send Array[3], Processor 3 receive from Processor 2 and add to Array[3] 

(stride 2)

Processor 1 sends Array[3], Processor 3 receive from Processor 1 and add to Array[3]

At last I would want to use MPI_Gather to let the result be:

WholeArray = [0 , 1 , 2 , 3 , 4 , 5 , 6 ,10 , 8 , 9 , 10 , 11 , 12 , 13 ,14 , 36]

I find it hard to write code to let the program do like the following 4nodes example:

(1st stride) - Processor 0 send to Processor 1 and Processor 1 receive from Processor 0
(1st stride) - Processor 2 send to Processor 3 and Processor 3 receive from Processor 2

(2nd stride) - Processor 1 send to Processor 3 and Processor 3 receive from Processor 1

Here is the code that I have written so far:

int Send_Receive(int* my_input, int size_per_process, int rank, int size)
{

    int key = 1;
    int temp = my_input[size_per_process-1];

    while(key <= size/2)
{
    if((rank+1) % key == 0)
      {
        if(rank/key % 2 == 0)
        {
            MPI_Send(&temp, 1, MPI_INT, rank+key,0,MPI_COMM_WORLD);
        }
        else
        {
            MPI_Recv(&temp, 1, MPI_INT, rank-key,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
            my_input[size_per_process]+= temp;
        }
        key = 2 * key;
        MPI_Barrier(MPI_COMM_WORLD);
      }
}

return (*my_input);

}

Solution

  • There is some issues in your code, namely 1) it always send the same temp variable across processes.

    MPI_Send(&temp, 1, MPI_INT, rank+key,0,MPI_COMM_WORLD);
    

    The temp variable is initialized before the loop:

     int temp = my_input[size_per_process-1];
     while(key <= size/2)
     { ...}
    

    but never updated inside the loop. This leads to wrong results, since after the first stride the last element of the my_input array will be different for some processes. Instead you should do:

    temp = localdata[size_per_process-1];
    MPI_Send(&temp, 1, MPI_INT, rank+key, 0, MPI_COMM_WORLD);
                                
    

    Moreover, 2) the following statement

    my_input[size_per_process]+= temp;
    

    does not add temp to the last position of the array my_input. Instead, it should be:

    my_input[size_per_process-1]+= temp;
    

    Finally, 3) there is deadlock and infinite loop issues. For starters having a call to a collective communication routine such as MPI_barrier inside a single conditional is typically a big red flag. Instead of:

    while(key <= size/2)
    {
       if((rank+1) % key == 0){
           ...
           MPI_Barrier(MPI_COMM_WORLD);
       }
    }
    

    you should have:

    while(key <= size/2)
    {
       if((rank+1) % key == 0){
           ...
       }
       MPI_Barrier(MPI_COMM_WORLD);
    }
    

    to ensure that every process calls the MPI_Barrier.

    The infinite loop happens because the while condition depends on the update of the key, but the key is only update when if((rank+1) % key == 0) evaluates to true. Therefore, when if((rank+1) % key == 0) evaluates to false, the process will never update the key, and consequently, gets stuck in an infinite loop.

    I running example with all the problems fixed :

    #include <stdio.h>
    #include <stdlib.h>
    #include <mpi.h>
    
    int main(int argc, char **argv){
        int rank, mpisize, total_size = 16;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
        int *data = NULL;   
    
        if(rank == 0){
           data = malloc(total_size * sizeof(int));
           for(int i = 0; i < total_size; i++)
              data[i] = i;
        }
        int size_per_process = total_size / mpisize;
        int *localdata = malloc(size_per_process * sizeof(int));
        MPI_Scatter(data, size_per_process, MPI_INT, localdata, size_per_process, MPI_INT, 0, MPI_COMM_WORLD);
    
        int key = 1;
        int temp = 0;
        while(key <= mpisize/2){                      
          if((rank+1) % key == 0){
              if(rank/key % 2 == 0){    
                 temp = localdata[size_per_process-1];
                 MPI_Send(&temp, 1, MPI_INT, rank+key, 0, MPI_COMM_WORLD);
              }
              else {
                 MPI_Recv(&temp, 1, MPI_INT, rank-key, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                 localdata[size_per_process-1]+= temp;
             }
          }
          key = 2 * key;
          MPI_Barrier(MPI_COMM_WORLD);
        }
    
        MPI_Gather(localdata, size_per_process, MPI_INT, data, size_per_process, MPI_INT, 0, MPI_COMM_WORLD);
    
        if(rank == 0){
           for(int i = 0; i < total_size; i++)
                   printf("%d ", data[i]);
           printf("\n");
        }
        free(data);
        free(localdata);    
        MPI_Finalize();
        return 0;
    }
    

    Input:

    [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
    

    Output:

    [0,1,2,3,4,5,6,10,8,9,10,11,12,13,14,36]