mpivalgrind

Using valgrind to spot error in mpi code


I have a code which works perfect in serial but with mpirun -n 2 ./out it gives the following error:

./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090

I tried to use valgrind such as:

valgrind --leak-check=yes mpirun -n 2 ./out

I got the following output:

==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494== 
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000  0         1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3497 RUNNING AT orhan
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494== 
==3494== HEAP SUMMARY:
==3494==     in use at exit: 131,120 bytes in 2 blocks
==3494==   total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494== 
==3494== LEAK SUMMARY:
==3494==    definitely lost: 0 bytes in 0 blocks
==3494==    indirectly lost: 0 bytes in 0 blocks
==3494==      possibly lost: 0 bytes in 0 blocks
==3494==    still reachable: 131,120 bytes in 2 blocks
==3494==         suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494== 
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I am not good in valgrind but what I understood is valgrind saw no problem. Are there better options for valgrind to spot the source of the specific error mentioned?


Solution

  • Note that with the invocation above,

    valgrind --leak-check=yes mpirun -n 2 ./out
    

    you are running valgrind on the program mpirun, which presumably has been extensively tested and works correctly, and not the program ./out, which you know to have a problem.

    To run valgrind on your test program you will want to do:

    mpirun -n 2 valgrind --leak-check=yes ./out
    

    Which uses mpirun to launch 2 processes, each running valgrind --leak-check=yes ./out.