Search code examples
fortrangfortranblasgprof

Why doesn't gprof count matmul?


I am profile my Fortran code using gprof and I have two main subroutines subroutine A and subroutine B. I am running each routine 10x, and then averaging the runtimes. Both routines make use of matmul, but subroutine B moreso.

When I print out the runtimes while linking with -fexternal-blas, I get:

Average time taken for routine A ....    0.41080 seconds
Average time taken for routine B ....    2.28760 seconds

When I print out the runtimes without using -fexternal-blas, I get:

Average time taken for routine A ....    0.41930 seconds
Average time taken for routine B ....    7.40090 seconds

so I know that matrix multiplication is a very large cause of the runtime.

When I profile with gprof, however, it tells me that subroutine A takes 42.4% of the time. It gives me 41.12% if I don't link to BLAS which is not much different.

I have segmented subroutine A into various smaller routines so that I can benchmark and find which routine is taking the longest. I don't think I am getting correct analysis, though, since I know that matmul is not being included. I would like to include matmul, since there are many times that I have to use transpose, reshape, and other times that I am relying on matrix multiplication when I could do something else, so I may be able to tweak things if I find that the matmuls in a certain routine are heavy weight.


Solution

  • It's because you are not instrumenting matmul

    When you compile with the appropriate flags for gprof (-pg for the gnu compiler) the compiler adds extra instructions into the object file that perform the timing you are interested in. This is termed "instrumentation". However matmul comes out of the library of routines that come with the compiler - it is already compiled and so only comes in at the link stage. Thus the instrumentation is not added, and nothing for matmul gets reported by gprof.

    If you want to include matmul in the profile you will have to do something like find the source for the compiler library and the external blas, compile it up with instrumentation, and link against that rather than the regular compiler library.