I would like to submit a simulation to several queues on my cluster. As soon as one queue would start it, it would be cancelled on the others. I understand it is potentially ill-defined as several jobs could start at the same time on several queues.
It is likely that a bash script monitoring the queue could do that. Is it possible to do it directly with qsub when submitting the job?
EDIT: Below is a working example which uses a bash script. This is probably not optimal as it requires (slow) disk access.
#!/bin/bash -
#
# Exit in case of error
set -e
#
# Command-line argument is the name of the shared file
fid=$*
if [ -f ${HOME}/.dep_jobs/${fid} ]; then
echo "Given name already used, abort."
exit 1
else
echo "Initialize case."
touch ${HOME}/.dep_jobs/${fid}
fi
#
# Submit master job and retrieve the ID
echo "Submitting master job"
MID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue1 run.pbs)
echo ${MID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${MID##* }
echo "M ${MID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Submit slave job and retrieve the ID
echo "Submitting slave job"
SID=$(qsub -l select=1:ncpus=1:mpiprocs=1 -q queue2 run.pbs)
echo ${SID##* }
#
# Add the ID to the shared file
ln -s ${HOME}/.dep_jobs/${fid} ${HOME}/.dep_jobs/${SID##* }
echo "S ${SID##* }" >> ${HOME}/.dep_jobs/${fid}
#
# Terminus, finalize case
echo "Finalize case"
echo "OK" >> ${HOME}/.dep_jobs/${fid}
And the submitted PBS script should start as follow
#!/bin/bash
#PBS -S /bin/bash
#PBS -N Parallel
#
# Define shared file
shared_file=${HOME}/.dep_jobs/${PBS_JOBID}
#
# Read it until it finishes with "OK"
while [[ "$(more ${shared_file} | tail -n1)" != "OK" ]]; do
sleep 1
done
#
# Read master and slave job id
while read -r line
do
key=$(echo ${line} | awk '{print $1}')
if [ "$key" = "M" ]; then
MID=$(echo ${line} | awk '{print $2}')
elif [ "$key" = "S" ]; then
SID=$(echo ${line} | awk '{print $2}')
fi
done < ${shared_file}
#
# Current job is master or slave?
if [ ${PBS_JOBID} = ${MID} ]; then
key="M"
other="${SID}"
else
key="S"
other="${MID}"
fi
#
# Check the status of the other job
status="$(qstat ${other} | tail -n1 | awk '{print $5}')"
#
# I am running, if the other is in queue, qdel it
if [ "${status}" = "Q" ]; then
$(qdel ${other})
# If the other is running, we have race and only master survives
elif [ "${status}" = "R" ]; then
if [ "${key}" = "M" ]; then
$(qdel ${other})
else
exit
fi
else
echo "We should not be here"
exit
fi
#
# The simulation goes here
Here is a script that runs with SGE scheduler. For PBS scheduler you need to make some minimal changes, like use
#PBS
instead of #$
and change $JOB_ID
to be $PBS_JOBID
.
Also for the SGE scheduler the better approach would be to run
qstat -u user_name -s p
command which would list only pending jobs, but I could not find a similar option for PBS scheduler, so assuming it does not exist, one approach may be to use the following script for your simulation jobs (you do not need any master script):
#!/bin/bash
#$-N myjobName
#$-q queueName
#... some other options if needed
# get the list of all running jobs
myjobs="$(qstat -u username | cut -d " " -f1 | tail -n +3| tr '\n' ' ' )"
# from the above list remove the current job (use PBS_JOBID for PBS scheduler)
deljobs="$(echo "${myjobs/$JOB_ID/}")"
echo "List of all jobs: $myjobs"
echo "List of jobs to delete: $deljobs"
#delete all other jobs
qdel $deljobs
#run the desired commands/programs
date
You will need to change username the above script within qstat command on your username. I would also recommend to check those commands one at a time to make sure they run correctly in your environment.
Here is some brief explanation of the commands I used in the script:
qstat -u username # check all running jobs
cut -d " " -f1 # extract JOBID for each job from the previous output (first column)
tail -n +3 # skip first 2 lines in the above output
tr '\n' ' ' # change new line character on space
echo "${myjobs/$JOB_ID/}" # from the string contained in $myjobs variable remove $JOB_ID