Search code examples
pythonnumpyparallel-processingsungridenginestarcluster

Sun Grid Engine, force one job per node


I am running many repeats of the same job using numpy on a cluster that uses sun grid engine to distribute jobs (starcluster). Each of my nodes has 2 cores (c3.large on AWS). So say I have 5 nodes, each with 2 cores.

The matrix operations in numpy are able to use more than one core at a time. What I'm finding is that SGE will send out 10 jobs to run at once, each job using a core. This is causing longer runtimes for the jobs. Looking at htop, it looks like the two jobs on each core are fighting over resources.

How can I tell qsub to distribute 1 job per node. So that when I submit my jobs, only 5 will be running at once, not 10?


Solution

  • Step 1: Add a complex values to your cluster. Run

    qconf -mc
    

    Add a line like

    exclusive        excl      INT         <=    YES         YES        0        0
    

    Step 2: For each of your nodes, define a value for that complex value.

    qconf -rattr exechost complex_values exclusive=1 <nodename>
    

    Here we set exclusive to 1. Then, when you launch jobs, request "1" of that resource. Eg.:

    qrsh -l exclusive=1 <myjob>
    

    If you were willing to have 2 jobs per node, you could define that value to 2 at step 2.

    EDIT: This is how to configure it per node. You could have done it for the entire cluster in step 1 by setting the value into the "default" column to 1.