Hybrid OpenMP+MPI : I need an explanation from this example

I found this example on internet, but I can't understand what exactly is sent from the master node if it's A[5] for example what will be sent to other slaves? The 5th row or all elements till 5th row or all elements from 5th row and so on??? #include #include #include #include

#define TAG 13

int main(int argc, char *argv[]) {
  //double **A, **B, **C, *tmp;
  double **A, **B, **C, *tmp;
  double startTime, endTime;
  int numElements, offset, stripSize, myrank, numnodes, N, i, j, k;
  int numThreads, chunkSize = 10;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  MPI_Comm_size(MPI_COMM_WORLD, &numnodes);

  N = atoi(argv[1]);
  numThreads = atoi(argv[2]);  // difference from MPI: how many threads/rank?

  omp_set_num_threads(numThreads);  // OpenMP call to set threads per rank

  if (myrank == 0) {
    tmp = (double *) malloc (sizeof(double ) * N * N);
    A = (double **) malloc (sizeof(double *) * N);
    for (i = 0; i < N; i++)
      A[i] = &tmp[i * N];
  }
  else {
    tmp = (double *) malloc (sizeof(double ) * N * N / numnodes);
    A = (double **) malloc (sizeof(double *) * N / numnodes);
    for (i = 0; i < N / numnodes; i++)
      A[i] = &tmp[i * N];
  }


  tmp = (double *) malloc (sizeof(double ) * N * N);
  B = (double **) malloc (sizeof(double *) * N);
  for (i = 0; i < N; i++)
    B[i] = &tmp[i * N];


  if (myrank == 0) {
    tmp = (double *) malloc (sizeof(double ) * N * N);
    C = (double **) malloc (sizeof(double *) * N);
    for (i = 0; i < N; i++)
      C[i] = &tmp[i * N];
  }
  else {
    tmp = (double *) malloc (sizeof(double ) * N * N / numnodes);
    C = (double **) malloc (sizeof(double *) * N / numnodes);
    for (i = 0; i < N / numnodes; i++)
      C[i] = &tmp[i * N];
  }

  if (myrank == 0) {
    // initialize A and B
    for (i=0; i<N; i++) {
      for (j=0; j<N; j++) {
        A[i][j] = 1.0;
        B[i][j] = 1.0;
      }
    }
  }

  // start timer
  if (myrank == 0) {
    startTime = MPI_Wtime();
  }

  stripSize = N/numnodes;

  // send each node its piece of A -- note could be done via MPI_Scatter
  if (myrank == 0) {
    offset = stripSize;
    numElements = stripSize * N;
    for (i=1; i<numnodes; i++) {

I can't understand here on the line below with the sending

MPI_Send(A[offset], numElements, MPI_DOUBLE, i, TAG, MPI_COMM_WORLD);
      offset += stripSize;
    }
  }
  else {  // receive my part of A

And here also:

MPI_Recv(A[0], stripSize * N, MPI_DOUBLE, 0, TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  }

Same here with broadcast what will be sent from B?

// everyone gets B
  MPI_Bcast(B[0], N*N, MPI_DOUBLE, 0, MPI_COMM_WORLD);

  // Let each process initialize C to zero 
  for (i=0; i<stripSize; i++) {
    for (j=0; j<N; j++) {
      C[i][j] = 0.0;
    }
  }

  // do the work---this is the primary difference from the pure MPI program
  #pragma omp parallel for shared(A,B,C,numThreads) private(i,j,k) schedule (static, chunkSize)
  for (i=0; i<stripSize; i++) {
    for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
    C[i][j] += A[i][k] * B[k][j];
      }
    }
  }

  // master receives from workers  -- note could be done via MPI_Gather
  if (myrank == 0) {
    offset = stripSize; 
    numElements = stripSize * N;
    for (i=1; i<numnodes; i++) {
      MPI_Recv(C[offset], numElements, MPI_DOUBLE, i, TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      offset += stripSize;
    }
  }
  else { // send my contribution to C
    MPI_Send(C[0], stripSize * N, MPI_DOUBLE, 0, TAG, MPI_COMM_WORLD);
  }



  MPI_Finalize();
  return 0;
}

Solution

A, B, and C are dynamic 2D arrays that use the pointer of pointers approach. The indices are A[row][col]. When omitting the last index the address of the first element (column zero) in that row is returned. This is useful because you can pass a single row by using that address and the "width" of the matrix. This is how the 2D arrays are being stored in memory:

enter image description here

Send the 5th row from matrix A with a total of num_cols columns:

MPI_Send(A[5], num_cols, MPI_DOUBLE, ...);

Equivalently, you could write &A[5][0] to access that same address, but it's just messier.

Furthermore, if you wish to send the full 2D matrix, this can be easily done because each row is stored contiguously in memory. Just use the first row B[0] (which points to the first column also) and use the N*N linearized length (assuming this is a square matrix).

Send a full N*N square matrix B:

MPI_Bcast(B[0], N*N, MPI_DOUBLE, ...);