parallel-processing mpi performance-testing hpc microbenchmark

Estimating strong scaling efficiency when single-node run is not possible

I have implemented an OpenMP/MPI hybrid parallel algorithm, and would like to measure its strong-scaling parallel efficiency. For this, I would have to calculate speed-up S=t(1)/t(N), and then the efficiency E=S/N.

Background: Having done some analysis, I was able to show that the peak efficiency of the algorithm could be expected at a problem size, at which the single node of my benchmark cluster cannot house the data required.

Possible solutions: I can either:

calculate speed-up using the smallest node-count, at which the data can be housed e.g. at 4 nodes => S=t(4)/t(N), or,
calculate the theoretical single-node time-to-solution t(1) by extrapolation, and then use that value as reference.

Questions:

Which approach is better and why?
If I use the first approach, can I, strictly speaking, even refer to it as strong-scaling parallel efficiency, seeing as it doesn't conform to the definition provided above?

Bonus question: When we measure t(1), should we run the algorithm with simulated communication calls (i.e. by calling mpirun -n 1 ./my_benchmark_program), or should we rather call a version of the program which performs no communication at all (i.e. ./my_openmp_only_benchmark_program)?

I hope this post is clear, please ask for clarification if it isn't. Any help will be greatly appreciated. Thanks in advance.

Solution

There are various problems with the classical definition of speedup if you are using MPI. The single processor case involves no communication, while the two-processor one does, so there is overhead in the t(2) case and it will always be less than twice as fast. This is even worse if you have a multicore/multinode setup, where up to 16 (or so) processes will run on a single node, so t(17) will suddenly be much slower because it starts involving a second node.

This means you can not simply apply the textbook formulas. You need to explain how you are doing your scalability study. For instance: one process per node until the number of processes == number of nodes, then start putting multple processes on each node, et cetera.

The fact that the single-process case does not fit in memory is then a minor hiccup: you start with a base case of multiple processes, and document that fact, plus your reasoning for the base case that you actually used.