Search code examples
slurmtermination

Slurm action at job termination or failure


I would like the slurm workload manager to do some action like touch stopped.txt at job termination either due to time out or failure. How can this be done?


Solution

  • When the job has terminated, there is no way for regular users to perform further actions. (Admins can use strigger or setup epilog scripts)

    For termination due to time out, the typical course of action is to setup a Bash "trap" to catch a signal and request Slurm to send that signal a few minutes before the job is killed.

    For termination due to failure, you can test the return code of your main program inside the submission script and act accordingly.

    Another option, which could be seen as overkill, but is easier to implement, is to submit a "monitoring" job, dependent on the job after which some action must be taken, and have that job create the stopped.txt file based on the state of the job in the accounting.