I have SLURM 21.08.8-2 installed on a cluster running Ubuntu 20.04. I have an existing workflow that submits a job array. I need to get the counts of failed and successful subjobs for a given array job. I've been reviewing documentation and haven't seen a SLURM variable that would provide these counts. I know I could create some logic to count this but was hoping there's a built-in variable available.
Does anyone have a good solution to this?
One nice tool to track progression and failure of jobs in an array is atools. It is a set of Python utilities that make the process of re-submitting failed jobs easier.