I am running many repeats of the same job using numpy on a cluster that uses sun grid engine to distribute jobs (starcluster). Each of my nodes has 2 cores (c3.large on AWS). So say I have 5 nodes, each with 2 cores.
The matrix operations in numpy are able to use more than one core at a time. What I'm finding is that SGE will send out 10 jobs to run at once, each job using a core. This is causing longer runtimes for the jobs. Looking at htop, it looks like the two jobs on each core are fighting over resources.
How can I tell qsub to distribute 1 job per node. So that when I submit my jobs, only 5 will be running at once, not 10?
Step 1: Add a complex values to your cluster. Run
qconf -mc
Add a line like
exclusive excl INT <= YES YES 0 0
Step 2: For each of your nodes, define a value for that complex value.
qconf -rattr exechost complex_values exclusive=1 <nodename>
Here we set exclusive to 1. Then, when you launch jobs, request "1" of that resource. Eg.:
qrsh -l exclusive=1 <myjob>
If you were willing to have 2 jobs per node, you could define that value to 2 at step 2.
EDIT: This is how to configure it per node. You could have done it for the entire cluster in step 1 by setting the value into the "default" column to 1.