With PBSpro I can request resources to run my job. My parallel cluster job boils down to running the same file multiple times, each time with a different index / job ID. Each task spawns its own sub-processes and each task in total uses 4 CPUs. This job is embarrassingly parallel, with each task independent of each other, and thus a good fit for the GNU parallel tool. To get the best usage of the cluster and squeeze my tasks in where ever there is space, I place a resource request to PBS as follows:
PBS -l select=60:ncpus=4:mpiprocs=1
. The resulting $PBS_NODEFILE
then contains a list of hosts assigned to the task.
The problem comes in with the fact that the PBSpro job manager can assign multiple jobs to the same node, or only 1 job to a node and somehow this information has to be passed to GNU parallel. Doing so with --sshloginfile $PBS_NODEFILE
does not carry over the varying resources information available on each node (and it appears GNU parallel only uses unique names from this list).
Things that go wrong are that GNU parallel sees X number of cores (the number of cores for the host / node) regardless whether only 1 job was assigned to that host. Limiting the number of jobs per host results in inefficient host usage with cores idle, or running more tasks on the host than available resources oversubscribing the cores.
The problem boils down to:
Use the -S
flag to specify the servers and the x/$SERVERNAME
variant thereof to limit the number of CPUs (x
) for that server.
The first step is to use bash to generate the input the -S
flag
NCPU=4
HOSTS=`cat $PBS_NODEFILE | uniq -c | awk 'BEGIN{OFS=""}{print $1*$NCPU,"/",$2}'|tr '\n' ','|sed 's/,$/ /'`
(credit to Hiu)
This bash command outputs a list of servers, each with the number of available cpu cores.
Thereafter run parallel as follows:
PERC=$((100/$NCPU))
seq 0 999 | parallel -j $PERC% -N1 -u -S $HOSTS "cd $PBS_O_WORKDIR; python3 $WORKING_PATH$INPUT_FILENAME {}"
Where:
seq 0 999
runs 1000 tasks with IDs ranging from 0 to and including 999-j $PERC%
= -j 25%
(100% / 4
for 4 CPUs)-N1
to send only 1 argument to each task-u
prints output immediately (and has some speed advantages)