Search code examples
segmentation-faultfortranmpiopenmpiderived-types

Segmentation fault in mpi_gather using fortran derived datatypes


I've been trying to write a program that computes the many millions of dipole-dipole interaction tensors as well as its derivatives. Because these tensors are trivially parallelizable, and often degenerate, I've decided to construct a lookup table (LUT) and distribute the work. Ultimately, they will be combined into a large matrix and diagonalized (I'll use scalapack for this eventually. For now, the diag fits on a node of nersc). To keep track of all the indices in fortran, I've constructed a couple derived data types.

type dtlrdr_lut
    sequence
    integer p
    integer q
    integer s
    integer i
    integer ind
    real(dp), dimension(3,3) :: dtlrdr
end type dtlrdr_lut

type dtlrdh_lut
    sequence
    integer p
    integer q
    integer ind
    real(dp), dimension(3, 3) :: TLR
    real(dp), dimension(3, 3, 3, 3) :: dTLRdh
end type dtlrdh_lut

In my subroutine where I want to parallelize all of this, I have:

    type(dtlrdr_lut), dimension(:), allocatable :: my_dtlrdr, collected_dtlrdr
    type(dtlrdh_lut), dimension(:), allocatable :: my_dtlrdh, collected_dtlrdh
    integer :: dh_dtype, dr_dtype, dh_types(5), dr_types(6), dh_blocks(5), dr_blocks(6)
    INTEGER(KIND=MPI_ADDRESS_KIND) :: dh_offsets(5), dr_offsets(6)
    if(.not.allocated(my_dtlrdh))    allocate(my_dtlrdh(my_num_pairs))
    if(.not.allocated(my_dtlrdr))    allocate(my_dtlrdr(my_num_pairs*3*nat))

    if(me_image.eq.root_image) then
        if(.not.allocated(collected_dtlrdh))    allocate(collected_dtlrdh(num_pairs))
        if(.not.allocated(collected_dtlrdr))    allocate(collected_dtlrdr(num_pairs*3*nat))
        end if
    call mpi_get_address(my_dtlrdr(1)%p,      dr_offsets(1), ierr)
    call mpi_get_address(my_dtlrdr(1)%q,      dr_offsets(2), ierr)
    call mpi_get_address(my_dtlrdr(1)%s,      dr_offsets(3), ierr)
    call mpi_get_address(my_dtlrdr(1)%i,      dr_offsets(4), ierr)
    call mpi_get_address(my_dtlrdr(1)%ind,    dr_offsets(5), ierr)
    call mpi_get_address(my_dtlrdr(1)%dtlrdr, dr_offsets(6), ierr)
    do i = 2, size(dr_offsets)
      dr_offsets(i) = dr_offsets(i) - dr_offsets(1)
    end do
    dr_offsets(1) = 0
    dr_types = (/MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_DOUBLE_PRECISION/)
    dr_blocks = (/1, 1, 1, 1, 1, 3*3/)
    call mpi_type_struct(6, dr_blocks, dr_offsets, dr_types, dr_dtype, ierr)
    call mpi_type_commit(dr_dtype, ierr)

    call mpi_get_address(my_dtlrdh(1)%p,      dh_offsets(1), ierr)
    call mpi_get_address(my_dtlrdh(1)%q,      dh_offsets(2), ierr)
    call mpi_get_address(my_dtlrdh(1)%ind,    dh_offsets(3), ierr)
    call mpi_get_address(my_dtlrdh(1)%TLR,    dh_offsets(4), ierr)
    call mpi_get_address(my_dtlrdh(1)%dTLRdh, dh_offsets(5), ierr)
    do i = 2, size(dh_offsets)
      dh_offsets(i) = dh_offsets(i) - dh_offsets(1)
    end do
    dh_offsets(1) = 0
    dh_types = (/MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION/)
    dh_blocks = (/1, 1, 1, 3*3, 3*3*3*3/)
    call mpi_type_struct(5, dh_blocks, dh_offsets, dh_types, dh_dtype, ierr)
    call mpi_type_commit(dh_dtype, ierr)
    call mpi_gather(my_dtlrdh, my_num_pairs, dh_dtype, &
                     collected_dtlrdh, num_pairs, dh_dtype, &
                     root_image, intra_image_comm)
    call mp_barrier(intra_image_comm)

    call mpi_gather(my_dtlrdr, my_num_pairs*3*nat, dr_dtype, &
                     collected_dtlrdr, num_pairs*3*nat, dr_dtype, &
                     root_image, intra_image_comm)

What is the result of the code? Well, the root process gathered and got to be barrier and then seg faulted:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
   Backtrace for this error:
    #0  0x10bac04f2
    #1  0x10bac0cae
    #2  0x7fff8d7c1f19

In this simulation on process 0:

size(my_dtlrdh) = 97
size(collected_dtlrdh) = 194
size(my_dtlrdr) = 873
size(collected_dtlrdr) = 1746

and on process 1

size(my_dtlrdh) = 97
size(collected_dtlrdh) = 3
size(my_dtlrdr) = 873
size(collected_dtlrdr) = 1650521

When I print the offsets, blocks, etc for process 0, I get:

printing dr vars           0
dr_blocks =  1  1  1  1  1  9
dr_offsets = 0  4  8  12 16 24
dr_types =   7  7  7  7  7  17
dr_dtype =   73

printing dh vars 0
dr_blocks =  1  1  1  9  81
dr_offsets = 0  4  8  16 88
dr_types =   7  7  7  17 17
dr_dtype =   74

and for process 1, I get:

printing dr vars  1
dr_blocks =   1  1  1  1  1  9
dr_offsets =  0  4  8 12 16 24
dr_types =    7  7  7  7  7 17
dr_dtype =    73

printing dh vars  1
dr_blocks =   1  1  1  9 81
dr_offsets =  0  4  8 16 88
dr_types =    7  7  7 17 17
dr_dtype =    74

The random size of dtlrdr on proc1 shouldn't matter, though, because it's not actually receiving anything. I can't seem to figure out what's going on or why process 1 can't get through the gather without an invalid memory reference. Any ideas? And please, let me know if you guys need any more information from me.


Solution

  • You've forgotten the error status flags in the last 3 subroutines (i.e. the last argument, ierr) that you have shared.

    I'll wager that you have made use of the Fortran include header file mpif.h rather than using the mpi module. If you had done the latter you would have automatic checking of the number of arguments and received an error message along the lines of

    "There is no matching specific subroutine for this generic subroutine call."

    due to the incorrect number of arguments.