Search code examples
hpcsungridenginetorquegrid-computing

Detect errors with torque and grid engine and prevent execution of dependent tasks


I have a shell script that queues multiple tasks for execution on an HPC cluster. The same job submission script works for either torque or grid engine with some minor conditional logic. This is a pipeline where the output of earlier tasks are fed to later tasks for further processing. I'm using qsub to define job dependencies, so later tasks wait for earlier tasks to complete before starting execution. So far so good.

Sometimes, a task fails. When a failure happens, I don't want any of the dependent tasks to attempt processing the output of the failed task. However, the dependent tasks have already been queued for execution long before the failure occurred. What is a good way to prevent the unwanted processing?


Solution

  • Here is what I eventually implemented. The key to making this work is returning error code 100 on error. Sun Grid Engine stops execution of subsequent jobs upon seeing error code 100. Torque stops execution of subsequent jobs upon seeing any non-zero error code.

    qsub starts a sequence of bash scripts. Each of those bash scripts has this code:

    handleTrappedErrors()
    {
    errorCode=$?
    bashCommand="$BASH_COMMAND"
    scriptName=$(basename $0)
    lineNumber=${BASH_LINENO[0]}
    # log an error message to a log file here -- not shown
    exit 100
    }
    
    
    trap handleErrors ERR
    

    Torque (as Derek mentioned):

    qsub -W depend=afterok:<jobid> ...
    

    Sun Grid Engine:

    qsub -hold_jid <jobid> ...