Search code examples
mpi

Using MPI_Iallgatherv without knowing in advance the number of elements that must be received by each process


In the function MPI_Iallgatherv

int MPI_Iallgatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                void *recvbuf, const int recvcounts[], const int displs[],
                MPI_Datatype recvtype, MPI_Comm comm, MPI_Request *request)

the argument recvcounts[] is an input argument, so as far as I can understand the program should know in advance how many elements must be received from each process. It is possible to set values larger than the the number of elements that are actually sent, as in the following example

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);
  int size;
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  if(size != 3) {
    printf("This application must run with 3 MPI processes.\n");
    MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
  }
  
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int counts[3] = {1000, 1000, 1000}; // Max n. of elems that can be received
  int displs[3] = {0, 1000, 2000};  // Displacements
  int buffer[3000]; // Receiving buffer
  // Buffer containing our data to send

  int send_values[1000]; // buffer for data to be sent
  int send_count;

  switch(rank) {
  case 0:
    {
      send_count = 1;
      send_values[0] = 3;
      break;
    }
  case 1:
    {
      send_count = 2;
      send_values[0] = 1;
      send_values[1] = 4;
      break;
    }
  case 2:
    {
      send_count = 3;
      send_values[0] = 1;
      send_values[1] = 5;
      send_values[2] = 9;
      break;
    }
  }
    
  MPI_Request recv_mpi_request;
 
  MPI_Iallgatherv(send_values, send_count, MPI_INT, buffer, counts, displs,
          MPI_INT, MPI_COMM_WORLD, &recv_mpi_request);

  MPI_Status status;
  MPI_Waitall(1, &recv_mpi_request, &status);
  
  printf("Values gathered on process %d:", rank);
  for(int i=0; i<3; i++) {
    for(int j=0; j<3; j++) {
      printf(" %d", buffer[i*1000+j]);
    }
    printf("\t");
  }
  printf("\n");
  
  MPI_Finalize();

  return EXIT_SUCCESS;
}

however I cannot find any elegant way to retrieve the number of elements that are actually sent by each process, and furthermore I wonder if the time required by the communication is affected by the fact that I am using predefined values of counts much larger than their actual values. I know that in principle I could use two separate MPI communications, one to get the counts, with one entry per process, and the second one to transfer the data with the previously obtained sizes. However each MPI transfer has a time overhead that is independent on the size and that is not negligible for the application I am working on. I wonder if there is a reasonable way to get the number of actual counts with a single call to MPI_Iallgatherv. Can someone help me?


Solution

  • As Gilles noted in the comment, all forms of MPI Allgather require the counts.

    The solution for the blocking case is to first do MPI_Allgather on the counts, then MPI_Allgatherv with the resulting count vector.

    I don't know a fully nonblocking implementation that doesn't involve generalize requests and the requisite progress thread. If you are willing to spawn a Pthread, you can put an allgather + allgatherv implementation in the generalized request callback.

    If you were interested in MPI_Igatherv without the input counts, you could have all the non-root processes do an Isend and then an Ibcast, and have the root loop over (M)Probe + (M)Recv then Ibcast the results back to the sources. This won't be particularly efficient, and you are almost certainly better off doing the count gather first.