Search code examples
jobsslurm

Having a SLURM job check how long until itself ends


I am wondering if I can get a SLURM job to check how long it will keep on running before the amount of time specified by #SBATCH --time is passed.

I thought of a solution, but it seems horrible to me: I know I can see how long the job has been running whith squeue and its options. So I could have the job calling squeue when I want the check to be done, store the output of the command in a variable (or a file) and read the amount of time since the job has started. Something like this

 status=$(squeue -j $job_id)    //Alternatively squeue -u my_username
 status_array=($status)
 time_since_start=${status[13]} 

Then it would only be a matter of computing the time difference. The problem with the above approach is that the job needs to know its own job_id. Even if I use -u my_username, I still need the job_id if I have more than one job running simultaneously, which is my typical case. The only way I can see to get the job to know its id is to instruct the script that launches it to write such id in a file and then have the job read that file.

I am wondering whether it exists a simpler/more elegant solution, maybe using SLURM commands (something like squeue -magic_option) but I could not find anything.


Solution

  • I am extending the answer by @damienfrancois because the format for the TimeLeft call to squeue can differ based on the remaining time for the job. If the remaining time is under 1 hour, the format for the output of TimeLeft will be mm:ss, not hh:mm:ss. If the time remaining is over 24 hours, the format is dd-hh:mm:ss.

    We can account for the possible variation in the number of fields by adding a second delimiter, using the NF variable, and adding if-else statements:

    squeue -h -j $SLURM_JOB_ID -O TimeLeft | awk -F':|-' 'if (NF == 1) print $NF; \
                 else if (NF == 2) print ($1 * 60) + ($2); \
                 else if (NF == 3) print ($1 * 3600) + ($2 * 60) + $3; \
                 else if (NF == 4) print ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4'
    

    The output is in seconds.