MPI_send and MPI_recv is hanging when count > 64 for MPI_FLOAT

I am encountering a problem when using MPI_Send and MPI_Recv. when the number of count <= 64, the overall problem runs without any problem, while for count > 64 the program is hanging.

is there any solution to this? the address is on the global memory address on two GPUs.

Here is the code I use. When I set the n<=64 it works, otherwise, it hangs.

#include <stdio.h>
#include <string.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    char *d_msg;
    int myrank, tag=99;
    MPI_Status status;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    const int n = 65; // <-- number of FLOATs
    const int num_GPUs = 2;

    cudaMalloc((void**)&d_msg, n*sizeof(float)); 

    MPI_Send(d_msg, n, MPI_FLOAT, (myrank + 1)%num_GPUs, tag, MPI_COMM_WORLD);
    MPI_Recv(d_msg, n, MPI_FLOAT, (myrank - 1 + num_GPUs)%num_GPUs, tag, MPI_COMM_WORLD, &status);

    MPI_Finalize();
    return 0;
}

Solution

MPI_Send is a blocking call. Your processes both sit in MPI_Send waiting for the other to call MPI_Recv. MPI_Send can be non-blocking for small messages, which is why it works with <= 64 elements.

Possible solutions are:

Call MPI_Send and MPI_Recv in alternating order on the communicating ranks
Use MPI_Sendrecv
Use non-blocking communication (MPI_Isend/MPI_Irecv)

The easiest here is probably to just use MPI_Sendrecv and replace the MPI_Send and MPI_Recv calls with

MPI_Sendrecv(d_msg, n, MPI_FLOAT, (myrank + 1)%num_GPUs, tag,
             d_msg, n, MPI_FLOAT, (myrank - 1 + num_GPUs)%num_GPUs, tag, MPI_COMM_WORLD, &status);