Search code examples
arraysjobsslurmhpc

SLURM Job Array | Statuses of all subjobs


I have SLURM 21.08.8-2 installed on a cluster running Ubuntu 20.04. I have an existing workflow that submits a job array. I need to get the counts of failed and successful subjobs for a given array job. I've been reviewing documentation and haven't seen a SLURM variable that would provide these counts. I know I could create some logic to count this but was hoping there's a built-in variable available.

Does anyone have a good solution to this?


Solution

  • One nice tool to track progression and failure of jobs in an array is atools. It is a set of Python utilities that make the process of re-submitting failed jobs easier.