Search code examples
cmpimpi-io

What could make MPI_File_write_all fail with Floating point exception?


I have a call to MPI_File_write_all:

double precision buf[100][100][100];
int data_size = 100*100*100;
MPI_Status stat_mpi;
MPI_file sgfh;

... 

MPI_File_write_all(sgfh, (void*)buf, data_size, MPI_DOUBLE, &stat_mpi);

The size of buf can vary, 100^3 is just an example. Under certain circumstances that I still don't have a complete handle on, the call to MPI_File_write_all fails with a floating-point exception. Everything I can test -- the buf array, value of data_size -- checks out OK.

Any idea what could cause this? I get the same error with Cray and gnu compilers, and regardless of optimization levels.

Sorry I don't have a small code that can repeat the problem. Stripping it down to bare essentials would still leave a code too big for this page.


Solution

  • The floating point exception likely comes from when the two-phase collective buffering algorithm tries (for some buggy reason) to divide by zero, and I've only seen that happen on Lustre when the stripe count is somehow incorrect.

    You can verify this theory by disabling collective I/O. Easiest way with Cray MPI is to set the MPICH_MPIIO_HINTS environment variable:

    export MPICH_MPIIO_HINTS='*:romio_cb_write=disable'
    aprun ... your_program
    

    Cray made the business decision to close-source their MPI-IO modifications to ROMIO. That choice is well within their rights but it means I can only offer vague suggestions. You'll have to contact your Cray support contact for an actual bug fix.