MPI communication delay - does the size of the message matter?

I've been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication. I'm using the following algorithm:

if(taskid==MASTER) {
  averow = NRA/numworkers;
  extra = NRA%numworkers;
  offset = 0;
  mtype = FROM_MASTER;
  for (dest=1; dest<=numworkers; dest++)
  {
     rows = (dest <= extra) ? averow+1 : averow;    
     MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
     MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
     MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,MPI_COMM_WORLD);
     MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
     offset = offset + rows;
  }
  mtype = FROM_WORKER;
  for (i=1; i<=numworkers; i++)
  {
     source = i;
     MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
     MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
     MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, 
              MPI_COMM_WORLD, &status);
     printf("Resultados recebidos do processo %d\n",source);
  }
}

else {
  mtype = FROM_MASTER;
  MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
  MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

  for (k=0; k<NCB; k++)
     for (i=0; i<rows; i++)
     {
        c[i][k] = 0.0;
        for (j=0; j<NCA; j++)
           c[i][k] = c[i][k] + a[i][j] * b[j][k];
     }
  mtype = FROM_WORKER;
  MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
  MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
  MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}

I noticed that, for square matrices it took less time than for rectangular ones. For example: if I use 4 nodes (one as master) and A is 500x500 and B is 500x500, the number of iterations per node equals 41.5 million, while if A is 2400000x6 and B is 6x6, it iterates 28.8 million times per node. Although the second case takes less iterations, it took about 1.00 second, while the first took only about 0.46s.

Logically, the second should be faster, considering it has less iterations per node. Doing some math, I realized that the MPI sends and receives 83,000 elements per message on the first case, and 4,800,000 elements on the second.

Does the size of the message justify the delay?

Solution

The size of messages sent over MPI will definitely affect the performance of your code. Take a look at THESE graphs posted in one of the popular MPI implementation's webpage.

As you can see in the first graph, the latency of communication increases with message size. This trend is applicable to any network and not just InfiniBand as indicated in this graph.