c performance parallel-processing mpi hpc

Garbage value recieved while using MPI_Pack and Unpack along with MPI_send and Recieve

I am trying to send a part of the matrix from one process to another. This is the given matrix

It has 10 rows and 8 columns. I am trying to send half of the columns ( 4 to 7) [NOTE--> matrix is 0 indexed] from process 0 to process 1 with the help of MPI_Pack(). For that I am using the following code

 double snd_buf[4][r];   //r is the number of rows
    double recv_buf[4][r];
    double buf[4][r];

   MPI_Request request[c];    //c is the number of columns
    MPI_Request request1[c];
    MPI_Status status[c];

//packing and sending the data
        if(myrank==0)
            {
                //we will send the half of the matrix to process 2
                for(int j=4;j<c;j++)
                {
                    position=0; //reassigning position after each and every send
                    
                    for(int i=0;i<r;i++)
                    {
                        MPI_Pack(&mat[i][j], 1 , MPI_DOUBLE,snd_buf[j-4],80,&position,MPI_COMM_WORLD);
                    }
                }
                
                
                //sending all the buffers
                for(int j=4;j<c;j++)
                {
                    MPI_Send (snd_buf[j-4], 10 , MPI_PACKED, 1 /*dest*/ , j /*tag*/ , MPI_COMM_WORLD);
                }
                
                
                
            }

And for receiving I am using the following code.

if(myrank==1)
        {
        
             for(j=4;j<c;j++)
             {      
                 MPI_Recv(recv_buf[j-4], 10, MPI_PACKED, 0 /*src*/ , j /*tag*/, MPI_COMM_WORLD,&status[j]);
                 
             }
            
            for(int j=4; j<c;j++)
            {
                position=0;
                 for(int i=0;i<r;i++)
                 {
                    MPI_Unpack(recv_buf[j-4],80,&position,&buf[j-4][i], 1/*outcount*/, MPI_DOUBLE, MPI_COMM_WORLD);
                 }
            }
}

But when I am printing the value of the recv_buf I am getting only the first element of each row in some cases followed by 0s and in some cases some garbage value as well. Given below is the content of the recv_buf.

Example-1:

Example-2:

I have checked my snd_buf[] as well, but it is packing all the values well and good.

I am not getting where I am going wrong and getting these 0's and sometimes garbage values in the recv_buf. Please help.

Solution

First

double snd_buf[4][r];   //r is the number of rows
double recv_buf[4][r];
double buf[4][r];

I think you meant:

double snd_buf[r][4];   //r is the number of rows
double recv_buf[r][4];
double buf[r][4];

From source one can read:

MPI_Pack - Packs data of a given datatype into contiguous memory.

You are misusing the packing/unpacking feature. Instead of packing each element to be sent, you just need to pack the rows with the columns that you want to send. Since the rows are allocated contiguously in memory you can pack them in one go, no need to pack each column, separately. Moreover, you are performing multiple calls to the send:

   for(int j=0;j<c;j++){
      MPI_Send (snd_buf[j-4], 10 , MPI_PACKED, 1 /*dest*/ , j /*tag*/ , MPI_COMM_WORLD);
  }

The point of the packing is to pack everything into a single buffer and send/recv it in one go. If you are going to perform multiple MPI_Send then there is not much benefit in packing, you are better off just sending/receiving directly the columns without having to pack anything, as follows:

if(myrank==0){
   for(int i=0;i<r;i++) // Send 4 columns of each row
       MPI_Send (&mat[i], 4, MPI_DOUBLE, 1, i, MPI_COMM_WORLD);
       
 }
 ...
 if(myrank==1){
    for(int i=0;i<r;i++){ // receive 4 columns of each row
      MPI_Recv(&mat[i], 4, MPI_DOUBLE, 0, i ...);
  }

Those, among others, are the fundamental errors that you need to fix in your logic to make it work.

That being said it is much easier, and efficient, to solve this problem by sending half of the lines instead of half of the columns.

You can first allocate a continuously 2D array (or simply represent the matrix as an array) and just send/recv half of the lines with a single call.

Here is a toy example illustrating the approach (it only works with two processes and it is not production-ready):

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define ROWS 10
#define COLS 8

int main( int argc, char *argv[])
{
     MPI_Status status;
     MPI_Init(&argc, &argv);    
     int myrank, size; //size will take care of number of processes 
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     MPI_Comm_size(MPI_COMM_WORLD, &size);
         
     if(myrank == 0){
        int (*arr)[COLS] = malloc(sizeof *arr * ROWS);
        // Just faking some data
        for(int i = 0; i < ROWS; i++)
           for(int j = 0; j < COLS; j++)
              arr[i][j] = i;
            
        MPI_Send(&arr[ROWS/2], ROWS/2 * COLS, MPI_INT, 1, 0, MPI_COMM_WORLD);
    }else{    
        int (*arr)[COLS] = malloc(sizeof *arr * ROWS/2);
        MPI_Recv(arr, ROWS/2 * COLS, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
        for(int i = 0; i < ROWS/2; i++){
           for(int j = 0; j < COLS; j++)
               printf("%d ",arr[i][j]);
           printf("\n");
        }
    }
   MPI_Finalize();
   return 0;
}

Input:

0 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 
2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 
4 4 4 4 4 4 4 4 
5 5 5 5 5 5 5 5 
6 6 6 6 6 6 6 6 
7 7 7 7 7 7 7 7 
8 8 8 8 8 8 8 8 
9 9 9 9 9 9 9 9

Output:

5 5 5 5 5 5 5 5 
6 6 6 6 6 6 6 6 
7 7 7 7 7 7 7 7 
8 8 8 8 8 8 8 8 
9 9 9 9 9 9 9 9

To scale this approach for multiple processes, you should replace the point-to-point communication routines (i.e., MPI_Send and MPI_Recv) by collective communication routines MPI_Scatterv:

Scatters a buffer in parts to all processes in a communicator

and MPI_GatherV

Gathers into specified locations from all processes in a group