I have Fortran code, for which I would like to enable trapping of floating-point exceptions (as recommended by the compiler man pages). But when I do so, the binary will no longer run in a Slurm queue.
The problem is apparently located in
libxml2 (a run-time dependency of Slurm, which provides a wrapper for MPI runtime, called with
MPI_Init). This triggers a FPE, so execution stops before my own code even had a chance to run.
Ideally, I would like to tell the compiler/linker to create binaries that trap FPEs created by my own code, but be insensitive to FPE conditions from external libraries.
integer :: ierror
print*,"MPI_Init returned", ierror
mpif90 -ffpe-trap=invalid,zero,overflow mwe.F90
mpifort -fpe0 mwe.F90
leaves me with a binary that runs fine
mpirun -np 1 (or any higher number).
But when called from a Slurm batch script, or run directly via
srun ./a.out, a floating-point exception is triggered from somewhere down the stack. The ifort backtrace tells me where:
$ srun ./a.out
forrtl: error (65): floating invalid
Image PC Routine Line Source
libc.so.6 000014609708A520 Unknown Unknown Unknown
libxml2.so.2.9.13 000014609407C723 Unknown Unknown Unknown
libxml2.so.2.9.13 0000146094054A4F xmlCheckVersion Unknown Unknown
hwloc_xml_libxml. 0000146094A100C6 Unknown Unknown Unknown
libhwloc.so.15.6. 00001460967BBF33 Unknown Unknown Unknown
libhwloc.so.15.6. 00001460967A83A9 Unknown Unknown Unknown
libopen-pal.so.40 0000146096D538FA opal_hwloc_base_g Unknown Unknown
libopen-rte.so.40 0000146096E4C630 orte_ess_base_pro Unknown Unknown
libopen-rte.so.40 0000146096E53705 Unknown Unknown Unknown
libopen-rte.so.40 0000146096EE05AD orte_init Unknown Unknown
libmpi.so.40.30.4 00001460975F2125 ompi_mpi_init Unknown Unknown
libmpi.so.40.30.4 0000146097414200 MPI_Init Unknown Unknown
libmpi_mpifh.so.4 00001460976F0D38 PMPI_Init_f08 Unknown Unknown
libmpi_usempif08. 0000146097734242 mpi_init_f08_ Unknown Unknown
ifort 000000000040B1F8 Unknown Unknown Unknown
ifort 000000000040B19D Unknown Unknown Unknown
libc.so.6 0000146097071D90 Unknown Unknown Unknown
libc.so.6 0000146097071E40 __libc_start_main Unknown Unknown
ifort 000000000040B0B5 Unknown Unknown Unknown
srun: error: localhost: task 0: Aborted
The culprit is
libxml2 (I get the same pointer from gdb for the gfortran binary).
I would prefer to have control over the pedantry level of my software at compile and run time, rather than some random dependencies collected along the way to restrict me in what I can choose and what I can't. But I guess there is no prospect of having the compiler/linker apply standards to my own code, but not to stuff that comes in from shared objects at run time?
This is particularly annoying since the alleged origin of the divide-by-zero error is libxml2 since around version 2.9.11. As long as many major/stable platforms are still on the 2.9 releases (which will be the case for a VERY long time), floating-point traps will be essentially unusable for projects aiming at usage with Slurm.
libxml2 version 2.10 was released some two years ago, and its developers seem to have found a more clever way to deal with that situation, it is not exactly an upstream bug (in an actively developed/maintaned branch), and it would be necessary to convince LTS/stable OS distributors to step in.
Platform: Ubuntu 22.04 LTS, x86_64, gcc 12.2.0, OpenMPI 4.1.4, libxml2 2.9.13
(This is something of a follow-up of SIGFPE - erroneous arithmetic operation - in MPI_Init() in Fortran . FWIW, this question is more focused on the root cause, and avoids distractions like CMake or specific choices for the Fortran compiler.)
As francescalus and Ian Bush commented, the natural way for that is the standard IEEE exceptions handling from Fortran 2003 and later. The code in my recent question Setting IEEE FPE halting mode for OpenMP threads is actually almost exactly that. It was originally precisely derived to disable the exception for a certain subroutine.
At the start of your program to enable the exception trapping (halting at FPE exceptions). Very likely, you are also able to use the compiler flags instead, but it is less portable.
call ieee_set_halting_mode(ieee_overflow, .true.)
call ieee_set_halting_mode(ieee_invalid, .true.)
call ieee_set_halting_mode(ieee_divide_by_zero, .true.)
Before calling the subroutine that needs to have the halting at FPE exceptions disabled
logical :: saved_fpe_mode(size(ieee_all))
call ieee_get_halting_mode(ieee_all, saved_fpe_mode)
call ieee_set_halting_mode(ieee_all, .false.)
after calling the subroutine to restore halting
call ieee_set_halting_mode(ieee_all, saved_fpe_mode)
As the linked question and the answer show, you may need to be careful if you use multiple threads as the halting mode can be local to each thread.