I have a Sun Grid Engine cluster on AWS EC2 that I set up using Starcluster. Each node has 4 processors and 16G RAM. I would like to submit a task array that will dispatch 2 jobs at a time each using up a full node (all 4 processors and 16G RAM). However, I don't want to create a parallel environment with flags like -pe smp 4 because empirically that reduces performance substantially. Is there a flag for qsub that says something like "submit job to a node that has 16G of memory that hasn't been allocated to any other job"? The flags I'm aware of are
-l mem_free=16g - submit job to node if it has 16g free at the moment -l h_vmem=16g - kill job if memory usage goes above 16g
Neither of these work for my problem. With mem_free=16g, because the jobs initially use memory slowly, qsub allocates all of the tasks to the 2 nodes and then they all run out of memory at the same time.
I do that with a manual variable. Here is the StarCluster code to it.
So basically it creates a variable "da_mem_gb". Each machine has an initial value for it equal to its RAM. Then the jobs request how much RAM they need using that variable. If they need all the RAM of a machine, then a single job is assigned to that machine at once.