Search code examples
sungridengine

How to notify the SGE of a job with several process


We build our software on an SGE running on CentOS slaves and it is working well. My question is how to tell the SGE that a job takes up several cores on given slave by starting several processes?

Explanation:
The process involves training models which entails lots of consecutive small changes to (relatively) big data files with the number of jobs being over 10k. Most of the tools we use support reading and writing from stdin/out. This would allow us to pipe the data from one tool to the next (tests indicate this will also work well).

The problem is that when a job starts two or more processes connected by a pipe the slave will get overloaded. How can I tell the SGE the number of processes in order to avoid that? This is only needed for the SGE and the nodes to work properly, not for any form of accounting.

Example:
2 compute nodes, NodeA & NodeB, each with 10 slots, configured to be assigned jobs in 'fill-up' mode.
Job1 "tool1 -a A -b B | tool2 -c C | tool3 -d D"

When I start 'Job1' and it is assigned to node NodeA three processes run there ('tool1', 'tool2', 'tool3'). But the SGE knows only about one job and still thinks it can assign 9 more jobs to node A instead of 7 which can lead to an overload of the node.

I did look at 'pe_range' but it seems to refer to multiple jobs not a job with multiple processes.

Thank you.


Solution

  • Your SGE cluster must be configured with something called a "parallel environment". Talk with your system admin to make sure the parallel environment exists, and what it is called. Then submit your job with qsub specifying the name of the parallel environment and the number of CPU cores you need on the node. For example if your parallel environment is called "foo" and you need 8 CPU cores, add these options to the qsub command line:

    -pe foo 8