gprof declares too much as “spontaneous” with Eigen

I have a C++ project which uses the Eigen matrix library. In order to improve performance I need to get a profile for it. I have tried using gprof, but the profile contains a bunch of results like this, where it is marked as “spontaneous”:

                                                 <spontaneous>
[1]     48.8 2535.09 38010.25                 GaugeField::read_lime_gauge_field_doubleprec_timeslices(double*, char const*, long, long) [1]
             20857.12    0.00 3419496363/5297636514     Eigen::internal::gebp_kernel<std::complex<double>, std::complex<double>, long, Eigen::internal::blas_data_mapper<std::complex<double>, long, 0, 0>, 1, 4, false, false>::operator()(Eigen::internal::blas_data_mapper<std::complex<double>, long, 0, 0> const&, std::complex<double> const*, std::complex<double> const*, long, long, long, std::complex<double>, long, long, long, long) [2]
             5844.01 11309.11 3350517373/3366570904     Eigen::internal::gebp_kernel<std::complex<double>, std::complex<double>, long, Eigen::internal::blas_data_mapper<std::complex<double>, long, 0, 0>, 1, 4, true, false>::operator()(Eigen::internal::blas_data_mapper<std::complex<double>, long, 0, 0> const&, std::complex<double> const*, std::complex<double> const*, long, long, long, std::complex<double>, long, long, long, long) [4]

Sometimes call to Eigen directly are marked spontaneous.

I spend 85 % of the time in parts which are marked as spontaneous. This is not much use as I already know that in my tensor contraction code the calls to Eigen will be most expensive. I would need to know from which part of my code these calls come from.

Is there some way to make gprof extract more useful information from my program?

Solution

It seems that there is not really a way to get around this with plain gprof without removing all the optimizations. That would of course skew the performance measurements and be useless.

I ended up just writing a little bit of code which just tracks the calls in a hierarchical manner. The program now has a bunch of TimingScope<3>("Diagram::request"); objects defined at the top of functions. These will then register themselves as an ongoing function call to a singleton. The destructor of that object will then stop the time and add it to the time of the given node and edge and deduce it from the self time of the parent. I use a Python script to generate output just like gprof2dot gives:

This allowed us to understand out program with even more context than gprof would have, if it had worked for us.