Search code examples
fortransignalsmpirestartsignal-handling

Signal handling and check pointing for mpif90


I have written a code for trapping the signal for CTRL+C for gfortran and it works.

program trap  
external trap_term  
call signal(2, trap_term)  
call sleep(60)  
end program trap  

function trap_term()  
integer::trap_term  
print*,'done'  
call exit(trap_term)  
end function trap_term  

How would one write exactly same thing for mpif90 ? Also, what is the best way to include checkpoints and restart (probably automatic) the code (from where left before) in parallel processors.

This is required because I have allocated time on clusters. Jobs are kicked out after fixed number of hours and a new resubmission is required.


Solution

  • Writing your software to checkpoint on receipt of a kill signal from the operating system is likely to be far less useful than you probably hope it will be. Suppose that you can code your program to write a full checkpoint in the time available to it when it is told to stop. You are then left with restarting your program from the arbitrary point at which it was previously stopped. That's a far from trivial problem.

    Why not do what many of us used to do, and many of us still do, in this domain ? Write your code to checkpoint every X iterations or at intervals of approximately Y minutes (you choose X and Y) ? And write routines to restart from one of those checkpoints in the event that a previous execution has been prematurely halted. This way you only have to restart from a single defined state of execution.

    You should probably be writing these checkpoint and restart routines anyway to guard against hardware problems, which only become worse as the CPU count rises and the number of network connections multiplies.

    I suppose you could write your code to keep an eye on the wall-clock, as it were, and tell it, on start-up, that it had an allowance of N hours so to checkpoint at N-n hours, where n is long enough to do the checkpointing with a small margin of error. But this approach won't help if a CPU fails mid-computation.