Search code examples
slurmsbatch

How to get count of failed and completed jobs in an array job of SLURM


I am running multiple array jobs using slurm. For a given array job id, let's say 885881, I want to list the count of failed and completed number of jobs. Something like this:

Input:

<some-command> -j 885881

Output: Let's say we have 200 jobs in the array.

count | status
120   | failed
80    | completed

Secondly, it'd be great if I can get the unique list of reasons due to which tasks failed.

Input:

`<some-command> -j 885881`

Output:

count | reason
80    | OUT_OF_MEMORY
40    | TIMED_OUT

I believe sacct command can be utilized to somehow get these results, but not sure how.


Solution

  • With a one-liner like this one, you can get both information at the same time

    $ sacct -n -X -j 885881 -o state%20 | sort | uniq -c
         16            COMPLETED
         99               FAILED
         32        OUT_OF_MEMORY
          1              PENDING
    

    The sacct command digs into the accounting information. The -n -X parameters are used to simplify the output and reduce the number of unnecessary lines, and the -o parameter requests only the STATE column to be displayed. Then the output is fed into the sort and uniq commands which do the counting.

    If you really need two separate commands, you can adapt the above one-liner easily. And you can make it a script or a Bash function for ease of use.

    If you would like a more elaborate solution, you can have a look at smanage and at atools