How to handle job cancelation in Slurm?

I am using Slurm job manager on an HPC cluster. Sometimes there are situations, when a job is canceled due to time limit and I would like to finish my program gracefully.

As far as I understand, the process of cancellation occurs in two stages exactly for a software developer to be able to finish the program gracefully:

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.                                                                                                                           
slurmstepd: error: *** JOB 18522559 ON ncm0317 CANCELLED AT 2020-12-14T19:42:43 DUE TO TIME LIMIT ***

You can see that I am given 62 seconds to finish the job the way I want it to finish (by saving some files, etc.).

Question: how to do this? I understand that first some Unix signal is sent to my job and I need to respond to it correctly. However, I cannot find in the Slurm documentation any information on what this signal is. Besides, I do not exactly how to handle it in Python, probably, through exception handling.

Solution

In Slurm, you can decide which signal is sent at which moment before your job hits the time limit.

From the sbatch man page:

--signal=[[R][B]:]<sig_num>[@<sig_time>] When a job is within sig_time seconds of its end time, send it the signal sig_num.

So set

#SBATCH --signal=B:TERM@05:00

to get Slurm to signal the job with SIGTERM 5 minutes before the allocation ends. Note that depending on how you start your job, you might need to remove the B: part.

In your Python script, use the signal package. You need to define a "signal handler", a function that will be called when the signal is receive, and "register" that function for a specific signal. As that function is disrupting the normal flow when called , you need to keep it short and simple to avoid unwanted side effects, especially with multithreaded code.

A typical scheme in a Slurm environment is to have a script skeleton like this:

#! /bin/env python

import signal, os, sys

# Global Boolean variable that indicates that a signal has been received
interrupted = False

# Global Boolean variable that indicates then natural end of the computations
converged = False

# Definition of the signal handler. All it does is flip the 'interrupted' variable
def signal_handler(signum, frame):
    global interrupted
    interrupted = True

# Register the signal handler
signal.signal(signal.SIGTERM, signal_handler)

try:
    # Try to recover a state file with the relevant variables stored
    # from previous stop if any
    with open('state', 'r') as file: 
        vars = file.read()
except:
    # Otherwise bootstrap (start from scratch)
    vars = init_computation()

while not interrupted and not converged:
    do_computation_iteration()    

# Save current state 
if interrupted:
    with open('state', 'w') as file: 
        file.write(vars)
    sys.exit(99)
sys.exit(0)

This first tries to restart computations left by a previous run of the job, and otherwise bootstraps it. If it was interrupted, it lets the current loop iteration finish properly, and then saves the needed variables to disk. It then exits with the 99 return code. This allows, if Slurm is configured for it, to requeue the job automatically for further iteration.

If slurm is not configured for it, you can do it manually in the submission script like this:

python myscript.py || scontrol requeue $SLURM_JOB_ID