Search code examples
parallel-processingmpiperformance-testinghpcmicrobenchmark

Estimating strong scaling efficiency when single-node run is not possible


I have implemented an OpenMP/MPI hybrid parallel algorithm, and would like to measure its strong-scaling parallel efficiency. For this, I would have to calculate speed-up S=t(1)/t(N), and then the efficiency E=S/N.

Background: Having done some analysis, I was able to show that the peak efficiency of the algorithm could be expected at a problem size, at which the single node of my benchmark cluster cannot house the data required.

Possible solutions: I can either:

  1. calculate speed-up using the smallest node-count, at which the data can be housed e.g. at 4 nodes => S=t(4)/t(N), or,
  2. calculate the theoretical single-node time-to-solution t(1) by extrapolation, and then use that value as reference.

Questions:

  1. Which approach is better and why?
  2. If I use the first approach, can I, strictly speaking, even refer to it as strong-scaling parallel efficiency, seeing as it doesn't conform to the definition provided above?

Bonus question: When we measure t(1), should we run the algorithm with simulated communication calls (i.e. by calling mpirun -n 1 ./my_benchmark_program), or should we rather call a version of the program which performs no communication at all (i.e. ./my_openmp_only_benchmark_program)?

I hope this post is clear, please ask for clarification if it isn't. Any help will be greatly appreciated. Thanks in advance.


Solution

  • There are various problems with the classical definition of speedup if you are using MPI. The single processor case involves no communication, while the two-processor one does, so there is overhead in the t(2) case and it will always be less than twice as fast. This is even worse if you have a multicore/multinode setup, where up to 16 (or so) processes will run on a single node, so t(17) will suddenly be much slower because it starts involving a second node.

    This means you can not simply apply the textbook formulas. You need to explain how you are doing your scalability study. For instance: one process per node until the number of processes == number of nodes, then start putting multple processes on each node, et cetera.

    The fact that the single-process case does not fit in memory is then a minor hiccup: you start with a base case of multiple processes, and document that fact, plus your reasoning for the base case that you actually used.