During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)
In the sequence program file is written as follows:
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
In the parallel program, I write the same as follows: / parallelization takes place only along the axis X /
call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)
call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)
do i = 1,PDB(iam).Xelements
do j = 1,YY
do k = 1,ZZ
dataTmp(1,k,j,i) = R(i,j,k)
dataTmp(2,k,j,i) = U(i,j,k)
dataTmp(3,k,j,i) = V(i,j,k)
dataTmp(4,k,j,i) = W(i,j,k)
dataTmp(5,k,j,i) = P(i,j,k)
end do
end do
end do
call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.
If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:
program testoutput
implicit none
integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k
allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))
R = 1.; U = 2.; V = 3.; W = 4.; P = 5.
open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for interleaved: ', elapsed
open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for noninterleaved: ', elapsed
deallocate(R,U,V,W,P)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program testoutput
Running gives
$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial
Elapsed time for interleaved: 225.755005
Elapsed time for noninterleaved: 4.01700020
As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.
You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.