Search code examples
memoryfortranmpigfortranintel-fortran

malloc(): unaligned tcache chunk detected. Has anyone faced this before for MPI fortran programs?


I have an MPI program, where I face the "malloc(): unaligned tcache chunk detected" error if I run it on one processor, but not on 8 processors. The memory allocation looks like this:


  ALLOCATE(XPOINTS((Npx+1)))
  IF(MY_RANK .eq. 0) WRITE(*,*)  "TESTING"
  ALLOCATE(YPOINTS((Npy+1)))
  ALLOCATE(ZPOINTS((Npz+1)))
  ALLOCATE(x_GLBL((1-Ngl):(Nx_glbl+Ngl)))
  ALLOCATE(y_GLBL((1-Ngl):(Ny_glbl+Ngl)))
  ALLOCATE(z_GLBL((1-Ngl):(Nz_glbl+Ngl)))

Note, that I have verified all the numbers for allocation are integers. This is the error that I am seeing:

 TESTING
malloc(): unaligned tcache chunk detected
malloc(): unaligned tcache chunk detected

Program received signal SIGABRT: Process abort signal.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Backtrace for this error:
#0  0x7f2145348960 in ???
#1  0x7f2145347ac5 in ???
#2  0x7f214513e51f in ???
        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x7f21451929fc in __pthread_kill_implementation
        at ./nptl/pthread_kill.c:44
#4  0x7f21451929fc in __pthread_kill_internal
        at ./nptl/pthread_kill.c:78
#5  0x7f21451929fc in __GI___pthread_kill
        at ./nptl/pthread_kill.c:89
#6  0x7f214513e475 in __GI_raise
        at ../sysdeps/posix/raise.c:26
#7  0x7f21451247f2 in __GI_abort
        at ./stdlib/abort.c:79
#8  0x7f2145185675 in __libc_message
        at ../sysdeps/posix/libc_fatal.c:155
#9  0x7f214519ccfb in malloc_printerr
        at ./malloc/malloc.c:5664
#10  0x7f21451a13db in tcache_get
        at ./malloc/malloc.c:3195
#11  0x7f21451a13db in __GI___libc_malloc
        at ./malloc/malloc.c:3313
#12  0x55ecaeda5ab3 in ???
#13  0x55ecaed90452 in ???
#14  0x55ecaed902ee in ???
#15  0x7f2145125d8f in __libc_start_call_main
        at ../sysdeps/nptl/libc_start_call_main.h:58
#16  0x7f2145125e3f in __libc_start_main_impl
        at ../csu/libc-start.c:392
#17  0x55ecaed90324 in ???
#18  0xffffffffffffffff in ???
#0  0x7efe26f48960 in ???
#1  0x7efe26f47ac5 in ???
#2  0x7efe26d3e51f in ???
        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x7efe26d929fc in __pthread_kill_implementation
        at ./nptl/pthread_kill.c:44
#4  0x7efe26d929fc in __pthread_kill_internal
        at ./nptl/pthread_kill.c:78
#5  0x7efe26d929fc in __GI___pthread_kill
        at ./nptl/pthread_kill.c:89
#6  0x7efe26d3e475 in __GI_raise
        at ../sysdeps/posix/raise.c:26
#7  0x7efe26d247f2 in __GI_abort
        at ./stdlib/abort.c:79
#8  0x7efe26d85675 in __libc_message
        at ../sysdeps/posix/libc_fatal.c:155
#9  0x7efe26d9ccfb in malloc_printerr
        at ./malloc/malloc.c:5664
#10  0x7efe26da13db in tcache_get
        at ./malloc/malloc.c:3195
#11  0x7efe26da13db in __GI___libc_malloc
        at ./malloc/malloc.c:3313
#12  0x55fa223ddab3 in ???
#13  0x55fa223c8452 in ???
#14  0x55fa223c82ee in ???
#15  0x7efe26d25d8f in __libc_start_call_main
        at ../sysdeps/nptl/libc_start_call_main.h:58
#16  0x7efe26d25e3f in __libc_start_main_impl
        at ../csu/libc-start.c:392
#17  0x55fa223c8324 in ???
#18  0xffffffffffffffff in ???

Has anyone faced this before? I tried everything and cant figure out why it doesnt work on less than 8 processors. Tried it with both Intel and GNU fortran. Is this a problem specific to my laptop?

I tried it with both Intel and GNU compilers. It works for 8 processors but not for 1 processor.

Edit: I am unable to reproduce this error in a simpler program, so I am attaching the git hub repo: https://github.com/SahajSJain/MyPoisonX.git Edit: I was finally able to recreate the error:

PROGRAM create_cart_coords
use, intrinsic :: iso_fortran_env
use mpi
implicit none
integer process_Rank, size_Of_Cluster, ierror, tag, Nx, Ny, Nz, Ngl
real(kind=real64), allocatable, dimension(:) :: array_x, array_y, array_z
real(kind=real64), allocatable, dimension(:,:,:) :: ALPHA, BETA, GAMMA

INTEGER, DIMENSION(3) :: NumProcArr,PeriodicArr, myCOORDS
INTEGER :: MaxDims, CommCart, reorder, myCartRank
INTEGER :: myWest, myEast, mySouth, myNorth, myBack, myFront
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)
NumProcArr=(/1,1,1/);
PeriodicArr=(/0,0,0/)
reorder=1;
MaxDims=3
print *, 'Hello World from process: ', process_Rank, 'of ', size_Of_Cluster
!! do cartesian mapping
call MPI_Cart_create(MPI_COMM_WORLD, MAXDIMS, NumProcArr,PeriodicArr, reorder, CommCart, ierror)
call MPI_Comm_rank(CommCart,myCartRank ,ierror);
call MPI_Cart_coords(CommCart, myCartRank, MAXDIMS, myCOORDS);
WRITE(*,*) "I am MPI Process" , myCartRank," out of ", size_Of_Cluster,", and I am located at",myCOORDS
CALL MPI_CART_SHIFT(CommCart, 0, 1, myWest, myEast, ierror)
CALL MPI_CART_SHIFT(CommCart, 1, 1, mySouth, myNorth, ierror)
CALL MPI_CART_SHIFT(CommCart, 2, 1, myBack, myFront, ierror)
CALL MPI_Barrier(CommCart,ierror)
Nx=50;
Ny=50;
Nz=50;
Ngl=2;
if(process_Rank .eq. 0) WRITE(*,*) "STARTING ALLOCATION"
ALLOCATE(array_x((1-Ngl):(Nx+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED ARRAY_X"
ALLOCATE(array_y((1-Ngl):(Ny+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED ARRAY_y"
ALLOCATE(array_z((1-Ngl):(Nz+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED ARRAY_Z"
ALLOCATE(ALPHA((1-Ngl):(Nx+Ngl),(1-Ngl):(Ny+Ngl),(1-Ngl):(Nz+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED ALPHA"
ALLOCATE(BETA((1-Ngl):(Nx+Ngl),(1-Ngl):(Ny+Ngl),(1-Ngl):(Nz+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED BETA"
ALLOCATE(GAMMA((1-Ngl):(Nx+Ngl),(1-Ngl):(Ny+Ngl),(1-Ngl):(Nz+Ngl)));
if(process_Rank .eq. 0) WRITE(*,*) "ALLOCATED GAMMA"

call MPI_FINALIZE(ierror)
END PROGRAM

Solution

  • The message malloc(): unaligned tcache chunk detected is an error message from the underlying implementation of allocate. In your case, the implementation of malloc seems to store additional meta information about the allocation block next to the heap allocation. During an allocation, malloc detects that this meta data was corrupted, which is typically caused by out-of-bounds write to another allocation.

    In general, AddressSanitizer and valgrind are tools to detect such out of bounds accesses during execution. As commented by @IanBush, for Fortran adding -fcheck=all is sufficient to detect these errors.

    I tried to compile your code with gfortran and OpenMPI. The compiler complains that the calls to MPI_Cart_create and MPI_Cart_coords do not match with the declaration. For MPI_Cart_create, the PeriodicArr argument is expected to be LOGICAL(*). The call to MPI_Cart_coords is missing the ierror argument.

    After fixing these errors to successfully compile, I cannot reproduce your error on my system, but the missing ierror argument might already explain the issue.

    The execution with mpirun -np 2 ./MyPoisonX reports:

    At line 199 of file CODE.SETUP_FIELD_VARIABLES.F90
    Fortran runtime error: Index '27' of dimension 1 of array 'dxinv' above upper bound of 26
    

    Which is a result from -fcheck=all analysis.

    To use AddressSanitizer, you would add -fsanitize=address to the CFLAGS and LFLAGS in your Makefile.

    Afterwards, execution with mpirun -np 2 env ASAN_OPTIONS="detect_leaks=0" ./MyPoisonX would report errors detected by ASan. I suggest disabling leak checks to avoid flooding the screen with tons of MPI-related memory leaks.

    For this specific code, ASan does not report OOB errors, while the Fortran-specific analysis reports errors. The reason is that ASan cannot detect index overflows, if they result in valid accesses to array memory, i.e. into the begin of the next column.