I am encountering a problem when using MPI_Send
and MPI_Recv
. when the number of count <= 64
, the overall problem runs without any problem, while for count > 64
the program is hanging.
is there any solution to this? the address is on the global memory address on two GPUs.
Here is the code I use. When I set the n<=64
it works, otherwise, it hangs.
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
char *d_msg;
int myrank, tag=99;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
const int n = 65; // <-- number of FLOATs
const int num_GPUs = 2;
cudaMalloc((void**)&d_msg, n*sizeof(float));
MPI_Send(d_msg, n, MPI_FLOAT, (myrank + 1)%num_GPUs, tag, MPI_COMM_WORLD);
MPI_Recv(d_msg, n, MPI_FLOAT, (myrank - 1 + num_GPUs)%num_GPUs, tag, MPI_COMM_WORLD, &status);
MPI_Finalize();
return 0;
}
MPI_Send
is a blocking call. Your processes both sit in MPI_Send
waiting for the other to call MPI_Recv
. MPI_Send
can be non-blocking for small messages, which is why it works with <= 64 elements.
Possible solutions are:
MPI_Send
and MPI_Recv
in alternating order on the communicating ranksMPI_Sendrecv
MPI_Isend
/MPI_Irecv
)The easiest here is probably to just use MPI_Sendrecv
and replace the MPI_Send
and MPI_Recv
calls with
MPI_Sendrecv(d_msg, n, MPI_FLOAT, (myrank + 1)%num_GPUs, tag,
d_msg, n, MPI_FLOAT, (myrank - 1 + num_GPUs)%num_GPUs, tag, MPI_COMM_WORLD, &status);