Search code examples
exceptionfortranslurmlibxml2

Use FP exception traps (-ffpe-trap/-fpe0) for code linked against SIGFPE-unsafe library (libxml2)


I have Fortran code, for which I would like to enable trapping of floating-point exceptions (as recommended by the compiler man pages). But when I do so, the binary will no longer run in a Slurm queue.

The problem is apparently located in libxml2 (a run-time dependency of Slurm, which provides a wrapper for MPI runtime, called with MPI_Init). This triggers a FPE, so execution stops before my own code even had a chance to run.

Ideally, I would like to tell the compiler/linker to create binaries that trap FPEs created by my own code, but be insensitive to FPE conditions from external libraries.

Minimal example:

program mwe
  use mpi_f08
  integer :: ierror
  call MPI_Init(ierror)
  print*,"MPI_Init returned", ierror
end program

Compiling with

  • gfortran as mpif90 -ffpe-trap=invalid,zero,overflow mwe.F90
  • ifort as mpifort -fpe0 mwe.F90

leaves me with a binary that runs fine

  • when called directly: ./a.out
  • when called with the parallel wrapper like mpirun -np 1 (or any higher number).

But when called from a Slurm batch script, or run directly via srun ./a.out, a floating-point exception is triggered from somewhere down the stack. The ifort backtrace tells me where:

 $ srun ./a.out 
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
libc.so.6          000014609708A520  Unknown               Unknown  Unknown
libxml2.so.2.9.13  000014609407C723  Unknown               Unknown  Unknown
libxml2.so.2.9.13  0000146094054A4F  xmlCheckVersion       Unknown  Unknown
hwloc_xml_libxml.  0000146094A100C6  Unknown               Unknown  Unknown
libhwloc.so.15.6.  00001460967BBF33  Unknown               Unknown  Unknown
libhwloc.so.15.6.  00001460967A83A9  Unknown               Unknown  Unknown
libopen-pal.so.40  0000146096D538FA  opal_hwloc_base_g     Unknown  Unknown
libopen-rte.so.40  0000146096E4C630  orte_ess_base_pro     Unknown  Unknown
libopen-rte.so.40  0000146096E53705  Unknown               Unknown  Unknown
libopen-rte.so.40  0000146096EE05AD  orte_init             Unknown  Unknown
libmpi.so.40.30.4  00001460975F2125  ompi_mpi_init         Unknown  Unknown
libmpi.so.40.30.4  0000146097414200  MPI_Init              Unknown  Unknown
libmpi_mpifh.so.4  00001460976F0D38  PMPI_Init_f08         Unknown  Unknown
libmpi_usempif08.  0000146097734242  mpi_init_f08_         Unknown  Unknown
ifort              000000000040B1F8  Unknown               Unknown  Unknown
ifort              000000000040B19D  Unknown               Unknown  Unknown
libc.so.6          0000146097071D90  Unknown               Unknown  Unknown
libc.so.6          0000146097071E40  __libc_start_main     Unknown  Unknown
ifort              000000000040B0B5  Unknown               Unknown  Unknown
srun: error: localhost: task 0: Aborted

The culprit is xmlCheckVersion in libxml2 (I get the same pointer from gdb for the gfortran binary).


I would prefer to have control over the pedantry level of my software at compile and run time, rather than some random dependencies collected along the way to restrict me in what I can choose and what I can't. But I guess there is no prospect of having the compiler/linker apply standards to my own code, but not to stuff that comes in from shared objects at run time?

This is particularly annoying since the alleged origin of the divide-by-zero error is libxml2 since around version 2.9.11. As long as many major/stable platforms are still on the 2.9 releases (which will be the case for a VERY long time), floating-point traps will be essentially unusable for projects aiming at usage with Slurm.

Since libxml2 version 2.10 was released some two years ago, and its developers seem to have found a more clever way to deal with that situation, it is not exactly an upstream bug (in an actively developed/maintaned branch), and it would be necessary to convince LTS/stable OS distributors to step in.

Platform: Ubuntu 22.04 LTS, x86_64, gcc 12.2.0, OpenMPI 4.1.4, libxml2 2.9.13

(This is something of a follow-up of SIGFPE - erroneous arithmetic operation - in MPI_Init() in Fortran . FWIW, this question is more focused on the root cause, and avoids distractions like CMake or specific choices for the Fortran compiler.)


Solution

  • As francescalus and Ian Bush commented, the natural way for that is the standard IEEE exceptions handling from Fortran 2003 and later. The code in my recent question Setting IEEE FPE halting mode for OpenMP threads is actually almost exactly that. It was originally precisely derived to disable the exception for a certain subroutine.

    At the start of your program to enable the exception trapping (halting at FPE exceptions). Very likely, you are also able to use the compiler flags instead, but it is less portable.

    use ieee_exceptions 
    
    call ieee_set_halting_mode(ieee_overflow, .true.)
    call ieee_set_halting_mode(ieee_invalid, .true.)
    call ieee_set_halting_mode(ieee_divide_by_zero, .true.)
    

    Before calling the subroutine that needs to have the halting at FPE exceptions disabled

    logical :: saved_fpe_mode(size(ieee_all))
    
    call ieee_get_halting_mode(ieee_all, saved_fpe_mode)
    call ieee_set_halting_mode(ieee_all, .false.)
    

    after calling the subroutine to restore halting

    call ieee_set_halting_mode(ieee_all, saved_fpe_mode)
    

    As the linked question and the answer show, you may need to be careful if you use multiple threads as the halting mode can be local to each thread.