Search code examples
parallel-processingpbstorque

PBS torque: how to solve cores waste problem in parallel tasks that spend very different time from each other?


I'm running parallel MATLAB or python tasks in a cluster that is managed by PBS torque. The embarrassing situation now is that PBS think I'm using 56 cores but that's in the first and eventually I have only 7 hardest tasks running. 49 cores are wasted now.

My parallel tasks took very different time because they did searches in different model parameters, I didn't know which task will spend how much time before I have tried. In the start all cores were used but soon only the hardest tasks ran. Since the whole task was not finished yet PBS torque still thought I was using full 56 cores and prevent new tasks run but actually most cores were idle. I want PBS to detect this and use the idle cores to run new tasks.

So my question is that are there some settings in PBS torque that can automatically detect real cores used in the task, and allocate the really idle cores to new tasks?

#PBS -S /bin/sh
#PBS -N alps_task
#PBS -o stdout
#PBS -e stderr
#PBS -l nodes=1:ppn=56
#PBS -q batch
#PBS -l walltime=1000:00:00
#HPC -x local
cd /tmp/$PBS_O_WORKDIR
alpspython spin_half_correlation.py 2>&1 > tasklog.log

Solution

  • A short answer to your question is No: PBS has no way to reclaim unused resources allocated to a job.

    Since your computation is essentially a bunch of independent tasks, what you could and probably should do is try to split your job into 56 independent jobs each running an individual combination of model parameters and when all the jobs are finished you could run an additional job to collect and summarize the results. This is a well supported way of doing things. PBS provides has some useful features for this type of jobs such as array jobs and job dependencies.