My program uses MPI+pthreads, where n-1 MPI processes are pure MPI code whereas the only one MPI process uses pthreads. The last process contains only 2 threads( main thread and pthread ). Suppose that the HPC cluster I want to run this program on consists of compute nodes, each of which has 12 cores. How should I write my batch script to maximise utilization of the hardware?
Following is my batch script I wrote. I use export OMP_NUM_THREADS=2 because the last MPI process has 2 threads and have to assume that the others have 2 threads each as well.
Then I allocate 6 MPI processes per node, so each node can run 6xOMP_NUM_THREADS = 12(=the number of cores on each node) threads despite the fact that all MPI processes but one have 1 thread.
#BSUB -J LOOP.N200.L1000_SIMPLE_THREAD
#BSUB -o LOOP.%J
#BSUB -W 00:10
#BSUB -M 1024
#BSUB -N
#BSUB -a openmpi
#BSUB -n 20
#BSUB -m xxx
#BSUB -R "span[ptile=6]"
#BSUB -x
export OMP_NUM_THREADS=2
How can I write a better script for this ?
The following should work if you'd like the last rank to be the hybrid one:
#BSUB -n 20
#BSUB -R "span[ptile=12]"
#BSUB -x
$MPIEXEC $FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program : \
$FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program
If you'd like rank 0 to be the hybrid one, simply switch the two lines:
$MPIEXEC $FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program : \
$FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program
This utilises the ability of Open MPI to launch MIMD programs.
You mention that your hybrid rank uses POSIX threads and yet you are setting an OpenMP-related environment variable. If you are not really using OpenMP, you don't have to set OMP_NUM_THREADS
at all and this simple mpiexec
command should suffice:
$MPIEXEC $FLAGS_MPI_BATCH ./program
(in case my guess about the educational institution where you study or work turns out to be wrong, remove $FLAGS_MPI_BATCH
and replace $MPIEXEC
with mpiexec
)