Identification and Diagnosis of Efficiency Problems in HPC

There are many articles and books on problems in HPC, but I feel like I am missing on the diagnose of scaling and efficiency issues. For example, I am reading a books called "Introduction to High Performance Computing for Scientists and Engineers" by Horst Simon where he discusses a wide variety of problems and solutions such as,

Cache misses
Load Imbalance
Poor Vectorization of code
etc.

But if I were handed a piece of code even remotely complex (ie more than nested for-loops) I would have a very hard time discovering what the bottleneck was or proving that the code had reached the limits of a given piece of hardware.

In analog with medicine, I can currently list out a bunch of possible diseases that make people "less efficient", but this is hardly useful. I need to figure out how to diagnose my "patients" and then prescribe a "cure".

Could I please be referred to literature that teaches how to diagnosis of HPC problems (efficiency, scalability, etc)? Almost a step-by-step guide. Like put stethoscope of chest, then listen, ...

Solution

This question is two questions: one is how do I find bottlenecks, the other is how do I know the limits of my hardware and if I am at them.

The first is that you must run the code inside a profiler. Any profiler with a "top down" view of your code according to time is showing you the bottlenecks.

Try the profilers suggested here (answer applies to c++ and Fortran): Good profiler for Fortran and MPI - both Allinea MAP and HPC Toolkit have the sort of presentation you need. (NB I work for Allinea).

The second question is the most "open" part. That one needs your book or optimization guide. However, a good start is to see how much vectorization you have (Some of the profiler examples can show this) as this is where the most compute power can be found.

The bigger question is what the theoretical limit of your problem is - eg. Some problems are not amenable to vectorization, some have memory access needs that can never be cache friendly, some have communication needs that are simple whereas others require costly regular global updates.