python numpy parallel-processing sungridengine starcluster

Sun Grid Engine, force one job per node

I am running many repeats of the same job using numpy on a cluster that uses sun grid engine to distribute jobs (starcluster). Each of my nodes has 2 cores (c3.large on AWS). So say I have 5 nodes, each with 2 cores.

The matrix operations in numpy are able to use more than one core at a time. What I'm finding is that SGE will send out 10 jobs to run at once, each job using a core. This is causing longer runtimes for the jobs. Looking at htop, it looks like the two jobs on each core are fighting over resources.

How can I tell qsub to distribute 1 job per node. So that when I submit my jobs, only 5 will be running at once, not 10?

Solution

Step 1: Add a complex values to your cluster. Run

qconf -mc

Add a line like

exclusive        excl      INT         <=    YES         YES        0        0

Step 2: For each of your nodes, define a value for that complex value.

qconf -rattr exechost complex_values exclusive=1 <nodename>

Here we set exclusive to 1. Then, when you launch jobs, request "1" of that resource. Eg.:

qrsh -l exclusive=1 <myjob>

If you were willing to have 2 jobs per node, you could define that value to 2 at step 2.

EDIT: This is how to configure it per node. You could have done it for the entire cluster in step 1 by setting the value into the "default" column to 1.