Search code examples
arraysmpimultidimensional-arraympi-io

MPI collective output 5 noncontiguous 3D arrays in special form


During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)

In the sequence program file is written as follows:

do i=1,XX
    do j=1,YY
        do k=1,ZZ
            write(ifile) R(i,j,k)
            write(ifile) U(i,j,k)
            write(ifile) V(i,j,k)
            write(ifile) W(i,j,k)
            write(ifile) P(i,j,k)
        end do
    end do
end do

In the parallel program, I write the same as follows: / parallelization takes place only along the axis X /

call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)   

call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)

do i = 1,PDB(iam).Xelements
    do j = 1,YY
        do k = 1,ZZ
            dataTmp(1,k,j,i) = R(i,j,k)
            dataTmp(2,k,j,i) = U(i,j,k)
            dataTmp(3,k,j,i) = V(i,j,k)
            dataTmp(4,k,j,i) = W(i,j,k)
            dataTmp(5,k,j,i) = P(i,j,k)
        end do
    end do
end do

call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.


Solution

  • If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:

    program testoutput
    implicit none
    
    integer, parameter :: XX=512, YY=512, ZZ=512
    real, dimension(:,:,:), allocatable :: R, U, V, W, P
    integer :: timer
    integer :: ifile
    real :: elapsed
    integer :: i,j,k
    
    allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
    allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))
    
    R = 1.; U = 2.; V = 3.; W = 4.; P = 5.
    
    open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
    call tick(timer)
    do i=1,XX
        do j=1,YY
            do k=1,ZZ
                write(ifile) R(i,j,k)
                write(ifile) U(i,j,k)
                write(ifile) V(i,j,k)
                write(ifile) W(i,j,k)
                write(ifile) P(i,j,k)
            end do
        end do
    end do
    elapsed=tock(timer)
    close(ifile)
    
    print *,'Elapsed time for interleaved: ', elapsed
    
    open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
    call tick(timer)
    write(ifile) R
    write(ifile) U
    write(ifile) V
    write(ifile) W
    write(ifile) P
    elapsed=tock(timer)
    close(ifile)
    
    print *,'Elapsed time for noninterleaved: ', elapsed
    
    deallocate(R,U,V,W,P)
    contains
    
    subroutine tick(t)
        integer, intent(OUT) :: t
    
        call system_clock(t)
    end subroutine tick
    
    ! returns time in seconds from now to time described by t 
    real function tock(t)
        integer, intent(in) :: t
        integer :: now, clock_rate
    
        call system_clock(now,clock_rate)
    
        tock = real(now - t)/real(clock_rate)
    end function tock
    
    end program testoutput
    

    Running gives

    $ gfortran -Wall io-serial.f90 -o io-serial
    $ ./io-serial 
     Elapsed time for interleaved:    225.755005    
     Elapsed time for noninterleaved:    4.01700020 
    

    As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.

    You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.