fortran mpi fortran90 fortran95 derived-types

How do I use MPI to send the correct number of derived-type objects?

Have some experience with MPI, but not with some of the more advanced aspects like derived types, which is what my question is related to.

The code I am working on has several arrays dimensioned (-1:nx+2,-1:ny+2,-1:nz+2). To make it clear, each process has its own values of nx, ny, and nz. There is overlap between the arrays. For instance x(:,:,-1:2) on one proc will represent the same information as x(:,:,nz-1:nz+2) on the proc just "below" it.

A derived cell_zface type has been defined:

idir = 3
sizes = (/nx_glb, ny_glb, nz_glb/)   !These nums are the same for all procs.
subsizes = (/nx, ny, 2/)
mpitype = MPI_DATATYPE_NULL
CALL MPI_TYPE_CREATE_SUBARRAY(3, sizes, subsizes, starts, &
    MPI_ORDER_FORTRAN, mpireal, mpitype, errcode)
CALL MPI_TYPE_COMMIT(mpitype, errcode)
cell_zface = mpitype

Now, this derived type gets used, successfully, in several MPI_SENDRECV calls. For example

CALL MPI_SENDRECV( &
        x(-1,-1,   1), 1, cell_zface, proc_z_min, tag, &
        x(-1,-1,nz+1), 1, cell_zface, proc_z_max, tag, &
        comm, status, errcode)

As I understand it, this call is sending and receiving two "horizontal" slices (i.e. x-y slices) of the array between procs.

I want to do something a little different, namely sending four "horizontal" slices. So I try

call mpi_send(x(-1,-1,nz-1), 2, cell_zface,    &
                  proc_z_max, rank, comm, mpierr)

with an accompanying receive.

And finally, my problem: The code runs, but erroneously. AFAICT, this sends only two horizontal slices, even though I use "2" instead of "1" as the count argument. I can fix this by making two calls to mpi_send:

call mpi_send(x(-1,-1,nz-1), 1, cell_zface,    &
                  proc_z_max, rank, comm, mpierr)
call mpi_send(x(-1,-1,nz+1), 1, cell_zface,    &
                  proc_z_max, rank, comm, mpierr)

with accompanying receives, but this is certainly not pretty.

So, why does the mpi_send send only two horizontal slices, even though I set the count argument to "2"? And is there a clean way to do what I want to do here?

Solution

Every MPI datatype has two sizes, so to speak. One is the true size, i.e. the amount of memory it takes to store all the significant data referred by the datatype. One can think of it as of the amount of space in the actual message that an element of that datatype takes.

Another size is the so-called extent. Each datatype in MPI is a collection of instructions of the type: "go to offset disp_i from the provided buffer location and read/write an element of basic type type_i". The set of all (type_i, disp_i) pairs is called the type map of the datatype. The minimum offset is called the lower bound and the maximum offset + the size of the of the basic type at that offset + any padding needed is called the upper bound. The extent of a datatype is the difference between the upper bound and the lower bound and gives the size of the shortest contiguous memory region, which includes all locations accessed by the datatype.

As MPI mandates that no memory location is read from or written to more than once during any communication operation, the pairs in the typemap have to refer to disjoint locations. Therefore, the true extent of a datatype is always bigger than or equal to its size.

MPI uses the extent of the datatype when accessing consecutive elements of that datatype. The following statement:

MPI_SEND(buf, n, dtype, ...)

results in:

MPI takes one element of type dtype from location buf following the rules encoded as the typemape of dtype;
MPI takes the next element starting from location buf + extent(dtype);
...
MPI takes the n-th element starting from location buf + (n-1)*extent(dtype).

Primitive datatypes such as MPI_INTEGER, MPI_REAL, etc. have their extent matching the size of the basic type (INTEGER, REAL, etc.) + any padding mandated by the architecture, which makes it possible to send arrays of the basic type by simply specifying the count of elements.

Now, back to your case. You are creating a datatype that covers an nx x ny x 2 subarray from an nx_glb x ny_glb x nz_glb array. The size of that datatype is indeed nx * ny * 2 times the size of mpireal, but the extent is actually nx_glb * ny_glb * nz_glb times the extent of mpireal. In other words:

MPI_SEND(buf, 2, cell_zface, ...)

will not extract two consecutive nx x ny x 2 slabs from the big array at buf. Rather, it will extract one slab from each of two consecutive arrays of size nx_glb x ny_glb x nz_glb, starting from location (start_x, start_y, start_z) in each array. If your program doesn't segfault when run, consider yourself lucky.

Now comes the tricky part. MPI allows one to give each datatype a fake extent (that's why I called the extent as defined earlier "true") by artificially setting the value of the lower and the upper bounds. Doing so does not affect the size of the datatype or its typemap (i.e. MPI still goes to the same offsets and manipulates elements of the same basic types), but affects the strides in memory that MPI makes while accessing consecutive elements of the given datatype. Earlier, setting the extent was done by "sandwiching" the datatype in a stricture between elements of the special pseudotypes MPI_LB and MPI_UB. Ever since MPI-2, the MPI_TYPE_CREATE_RESIZED is used to achieve the same.

integer(kind=MPI_ADDRESS_KIND) :: lb, extent
integer :: newtype

! First obtain the extent of the old type used to construct cell_zface
call MPI_TYPE_GET_EXTENT(mpireal, lb, extent, errcode)
! Adjust the extent of cell_zface
extent = (nx_glb * ny_glb * subsizes(3)) * extent
call MPI_TYPE_CREATE_RESIZED(cell_zface, lb, extent, newtype, errcode)
call MPI_TYPE_COMMIT(newtype, errcode)
! Get rid of the previous type
call MPI_TYPE_FREE(cell_zface, errcode)
cell_zface = newtype

You can now use cell_zface to send several consecutive slabs.

An alternative and presumably simpler approach is to set the size of 3-rd dimension of the array equal to the size of the 3-rd dimension of the subarray while calling MPI_TYPE_CREATE_SUBARRAY:

idir = 3
subsizes = (/nx, ny, 2/)
sizes = (/nx_glb, ny_glb, subsizes(3)/)   !These nums are the same for all procs.
mpitype = MPI_DATATYPE_NULL
CALL MPI_TYPE_CREATE_SUBARRAY(3, sizes, subsizes, starts, &
    MPI_ORDER_FORTRAN, mpireal, mpitype, errcode)
CALL MPI_TYPE_COMMIT(mpitype, errcode)
cell_zface = mpitype

In both cases I assume that starts(3) is equal to 0.