arrays mpi multidimensional-array mpi-io

MPI collective output 5 noncontiguous 3D arrays in special form

During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)

In the sequence program file is written as follows:

do i=1,XX
    do j=1,YY
        do k=1,ZZ
            write(ifile) R(i,j,k)
            write(ifile) U(i,j,k)
            write(ifile) V(i,j,k)
            write(ifile) W(i,j,k)
            write(ifile) P(i,j,k)
        end do
    end do
end do

In the parallel program, I write the same as follows: / parallelization takes place only along the axis X /

call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)   

call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)

do i = 1,PDB(iam).Xelements
    do j = 1,YY
        do k = 1,ZZ
            dataTmp(1,k,j,i) = R(i,j,k)
            dataTmp(2,k,j,i) = U(i,j,k)
            dataTmp(3,k,j,i) = V(i,j,k)
            dataTmp(4,k,j,i) = W(i,j,k)
            dataTmp(5,k,j,i) = P(i,j,k)
        end do
    end do
end do

call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.

Solution

If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:

program testoutput
implicit none

integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k

allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))

R = 1.; U = 2.; V = 3.; W = 4.; P = 5.

open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
    do j=1,YY
        do k=1,ZZ
            write(ifile) R(i,j,k)
            write(ifile) U(i,j,k)
            write(ifile) V(i,j,k)
            write(ifile) W(i,j,k)
            write(ifile) P(i,j,k)
        end do
    end do
end do
elapsed=tock(timer)
close(ifile)

print *,'Elapsed time for interleaved: ', elapsed

open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)

print *,'Elapsed time for noninterleaved: ', elapsed

deallocate(R,U,V,W,P)
contains

subroutine tick(t)
    integer, intent(OUT) :: t

    call system_clock(t)
end subroutine tick

! returns time in seconds from now to time described by t 
real function tock(t)
    integer, intent(in) :: t
    integer :: now, clock_rate

    call system_clock(now,clock_rate)

    tock = real(now - t)/real(clock_rate)
end function tock

end program testoutput

Running gives

$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial 
 Elapsed time for interleaved:    225.755005    
 Elapsed time for noninterleaved:    4.01700020

As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.

You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.