Search code examples
bashunixerror-handlingpbsbash-trap

Trap command in script works when called from CLI but not when used in a PBS job


I have the following simple bash script:

#!/bin/bash

set -o pipefail
set -o errtrace
set -o errexit

PROGNAME=$0

trap 'echo "${PROGNAME} recieved signal EXIT" | mailx -s "EXIT" "someone@anywhere.com"' EXIT
trap 'echo "${PROGNAME} recieved signal SIGHUP" | mailx -s "SIGHUP" "someone@anywhere.com"' SIGHUP
trap 'echo "${PROGNAME} recieved signal SIGINT" | mailx -s "SIGINT" "someone@anywhere.com"' SIGINT
trap 'echo "${PROGNAME} recieved signal SIGQUIT" | mailx -s "SIGQUIT" "someone@anywhere.com"' SIGQUIT
trap 'echo "${PROGNAME} recieved signal SIGTERM" | mailx -s "SIGTERM" "someone@anywhere.com"' SIGTERM

sleep 1000

When I run this script from the command line: i.e.

./test_script.sh

And then interrupt the script by sending CTRL+C I get two emails. One containing the message: "recieved signal EXIT". The other containing the message "recieved signal SIGINT".

However when I run this script as a PBS job:

qsub test_script.sh

And then wait for a minute or two and perform a qdel on the submitted job, I only recieve an email containing "recieved signal EXIT". While I also expected to get an email stating recieved signal SIGTERM, because the qdel man page states:

A batch job being deleted by a server will be sent a SIGTERM signal following by a SIGKILL signal

Does someone know why this is? Ideally I would like to recieve an email when something inside my script returns an exit code different than 0, but I would also like to recieve a different email when the script terminates earlier than expected, for instance because of a SIGINT or a SIGTERM.

Some additional information, when I modify the line:

trap 'echo "${PROGNAME} recieved signal EXIT" | mailx -s "EXIT" "someone@anywhere.com"' EXIT

to

trap 'echo "${PROGNAME} recieved signal EXIT, last command was ${BASH_COMMAND}" | mailx -s "EXIT" "someone@anywhere.com"' EXIT

I can see that the last command executed was "mailx -s "SIGTERM" "someone@anywhere.com" and not "sleep 1000". So it does seem to be the case the SIGTERM signal gets caught, but the subsequent trap command does not work for PBS jobs...


Solution

  • This is rather confusing but the problem is that the script is trapping the signal and the shell running the script isn't. There are two ways to solve this:

    1. Use the $exec_with_exec option in the mom's config file. This has the pbs_mom launch the job slightly differently (using exec) which handles the issue for you. You'll need admin rights to change the config file, but this parameter is documented here.
    2. Configure the shell to also trap the signal (this can have unintended consequences).