I am running multiple array jobs using slurm. For a given array job id, let's say 885881, I want to list the count of failed and completed number of jobs. Something like this:
Input:
<some-command> -j 885881
Output: Let's say we have 200 jobs in the array.
count | status
120 | failed
80 | completed
Secondly, it'd be great if I can get the unique list of reasons due to which tasks failed.
Input:
`<some-command> -j 885881`
Output:
count | reason
80 | OUT_OF_MEMORY
40 | TIMED_OUT
I believe sacct
command can be utilized to somehow get these results, but not sure how.
With a one-liner like this one, you can get both information at the same time
$ sacct -n -X -j 885881 -o state%20 | sort | uniq -c
16 COMPLETED
99 FAILED
32 OUT_OF_MEMORY
1 PENDING
The sacct
command digs into the accounting information. The -n -X
parameters are used to simplify the output and reduce the number of unnecessary lines, and the -o
parameter requests only the STATE column to be displayed. Then the output is fed into the sort
and uniq
commands which do the counting.
If you really need two separate commands, you can adapt the above one-liner easily. And you can make it a script or a Bash function for ease of use.
If you would like a more elaborate solution, you can have a look at smanage and at atools