openMPI command on cluster

I am using openMPI-1.6 on a cluster, which has 8 nodes and each node has 8 cores. I am using this command to run my application.

/path/to/mpirun --mca btl self,sm,tcp --hostfile $PBS_NODEFILE -np $num_core /path/to/application

I have run experiments and got the following data.

num node | num core per node | total core | exe time of application |
    1              2                2                8.5 sec
    1              4                4                5.3 sec
    1              8                8                7.1 sec

    2              1                2                11 sec
    2              2                4                9.5 sec
    2              4                8                44 sec //this is too slow

As you can see the execution time of the last row (2 nodes, 8 cores) is too slower than the others. I assumed an overhead using more than one node but I didn't expect this exponential degradation.

So, my question is that is there any openMPI performance parameters I am missing to run jobs on a cluster using more than one node? I assumed mca btl self,sm,tcp parameter automatically uses the shared-memory option for the communication inside of a node and will use 'tcp' option for the communication sent to the outside of a node. Do I understand it correctly?

I know it is hard to tell without knowing the application but I am asking a general parameter tuning which should be independent to the application.

Solution

Open MPI is pretty good at guessing the right list of BTLs based on the network and node topology so in general you don't need to specify it explicitly.
You specify the path to the hosts file created by Torque/PBS. If compiled accordingly, Open MPI supports tight integration with Torque/PBS (via the tm module) and one would not need to specify neither the hosts file nor the number of MPI processes.

Given all that your code doesn't seem to scale past 4 MPI processes. It runs slower with 8 processes than with 4 on a single machine. This could be due to memory bandwidth saturation or due to high communication to computation ratio (usually means that your problem size is too small). It is hard to tell which one is the culprint without you showing some more code but I can make the wild guess that it is latter. TCP has very high latency, especially when coupled with slow networks like Ethernet.

In time one learns to anticipate this kind of ill behaviour on certain type of networks based on the structure of the algorithm but util then I would suggest that you use some MPI profiling or tracing tool to investigate the behaviour of your program. See this question for a list of such tools.