I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS
for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots
parameter and then SGE tries to find num_slots
slots following the allocation policy of the pe_name
PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M
slots per host (and not a multiple of M
), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn
, set its allocation_rule
to 8
, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40
and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1
for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G
. Assuming that the nodes are configured with h_vmem
of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem
for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem
of less than 47 GB. h_vmem
serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores
with k
being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.