Search code examples
batch-processinghpcqsub

PBS jobs inter-dependency: one job starts, cancel others


I would like to submit a simulation to several queues on my cluster. As soon as one queue would start it, it would be cancelled on the others. I understand it is potentially ill-defined as several jobs could start at the same time on several queues.

It is likely that a bash script monitoring the queue could do that. Is it possible to do it directly with qsub when submitting the job?

EDIT: Below is a working example which uses a bash script. This is probably not optimal as it requires (slow) disk access.

#!/bin/bash -
#
# Exit in case of error
set -e
#
# Command-line argument is the name of the shared file
fid=$*
if [ -f ${HOME}/.dep_jobs/${fid} ]; then
  echo "Given name already used, abort."
  exit 1
else
  echo "Initialize case."
  touch ${HOME}/.dep_jobs/${fid}
fi
#
# Submit master job and retrieve the ID
echo "Submitting master job"
MID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue1 run.pbs)
echo ${MID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${MID##* }
echo "M ${MID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Submit slave job and retrieve the ID
echo "Submitting slave job"
SID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue2 run.pbs)
echo ${SID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${SID##* }
echo "S ${SID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Terminus, finalize case
echo "Finalize case"
echo "OK" >> ${HOME}/.dep_jobs/${fid}

And the submitted PBS script should start as follow

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Parallel
#
# Define shared file
shared_file=${HOME}/.dep_jobs/${PBS_JOBID}
#
# Read it until it finishes with "OK"
while [[ "$(more ${shared_file} | tail -n1)" != "OK" ]]; do
  sleep 1
done
#
# Read master and slave job id
while read -r line
do
  key=$(echo ${line} | awk '{print $1}')
  if [ "$key" = "M" ]; then
    MID=$(echo ${line} | awk '{print $2}')
  elif [ "$key" = "S" ]; then
    SID=$(echo ${line} | awk '{print $2}')
  fi
done < ${shared_file}
#
# Current job is master or slave?
if [ ${PBS_JOBID} = ${MID} ]; then
  key="M"
  other="${SID}"
else
  key="S"
  other="${MID}"
fi
#
# Check the status of the other job
status="$(qstat ${other} | tail -n1 | awk '{print $5}')"
#
# I am running, if the other is in queue, qdel it
if [ "${status}" = "Q" ]; then
  $(qdel ${other})
# If the other is running, we have race and only master survives
elif [ "${status}" = "R" ]; then
  if [ "${key}" = "M" ]; then
    $(qdel ${other})
  else
    exit
  fi
else
  echo "We should not be here"
  exit
fi
#
# The simulation goes here

Solution

  • Here is a script that runs with SGE scheduler. For PBS scheduler you need to make some minimal changes, like use #PBS instead of #$ and change $JOB_ID to be $PBS_JOBID. Also for the SGE scheduler the better approach would be to run qstat -u user_name -s p command which would list only pending jobs, but I could not find a similar option for PBS scheduler, so assuming it does not exist, one approach may be to use the following script for your simulation jobs (you do not need any master script):

    #!/bin/bash
    
    
    #$-N myjobName
    #$-q queueName
    #... some other options if needed
    
    
    # get the list of all running jobs
    myjobs="$(qstat -u username | cut -d " " -f1 | tail -n +3| tr '\n' ' ' )"
    
    # from the above list remove the current job (use PBS_JOBID for PBS scheduler)
    deljobs="$(echo "${myjobs/$JOB_ID/}")"
    
    echo "List of all jobs: $myjobs"
    echo "List of jobs to delete: $deljobs"
    
    #delete all other jobs
    qdel $deljobs
    
    #run the desired commands/programs
    date
    

    You will need to change username the above script within qstat command on your username. I would also recommend to check those commands one at a time to make sure they run correctly in your environment.

    Here is some brief explanation of the commands I used in the script:

    qstat -u username  # check all running jobs
    cut -d " " -f1     # extract JOBID for each job from the previous output (first column)
    tail -n +3         # skip first 2 lines in the above output
    tr '\n' ' '        # change new line character on space
    
    echo "${myjobs/$JOB_ID/}"  # from the string contained in $myjobs variable remove $JOB_ID