Search code examples
cluster-computingmpidistributed-computingopenmpimpich

Open MPI/MPICH - What happens if a node terminates?


I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?

Thanks for your answers Heinrich


Solution

  • Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,

    http://www.mpi-forum.org/docs/mpi-11-html/node148.html

    As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).

    But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.

    There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.

    The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.