Search code examples
c++gccperformance-testingtrigonometrycallgrind

Should I trust profiling inside or outside of callgrind for a function that calls glibc's sin()?


I'm working on an audio library in which the sine of a number needs to be calculated within a very tight loop. Various levels of inaccuracy in the results might be tolerable for the user depending on their goals and environment, so I'm providing the ability to pick between a few sine approximations with differing accuracy and speed characteristics. One of these shows as ~31% faster than glibc's sin() when running under callgrind, but ~2% slower when running outside of it if the library is compiled with -O3 and ~25% slower if compiled with -Ofast. Should I trust callgrind or the "native" results, in terms of designing the library's interface?

My gut instinct is to distrust callgrind and go with the wall-clock results, because that's what really matters in the end anyway. However, I'm worried that what I'm seeing is caused by something particular about my processor (i7-7700k), compiler (gcc 10.2.0) or other aspects of my environment (Arch Linux, kernel v5.9.13) that might not carry over for other users. Is there any chance that callgrind is showing me something "generally true", even if it's not quite true for me specifically?

The relative performance differences of the in-library sine implementations stay the same in and outside of callgrind; only the apparent performance of glibc's sin() differs. These patterns hold with variable amounts of work and across repeated runs. Interestingly, with -O1 the relative performance differences are comparable inside and outside of callgrind, but not with -O0, -O2, -O3, or -Ofast.

The input to glibc's sin() is in many ways a good case for it: it's a double that is always <= 2π, and is never subnormal, NaN, or infinite. This makes me wonder if the glibc sin() might be calling my CPU's fsin instruction some of the time, as Intel's documentation says it's reasonably accurate for arguments < ~3π/4 (see Intel 64 and IA-32 Architectures Developer's Manual: Vol. 1, pg. 8-22). If that is the case, it seems possible that the behavior of the Valgrind VM would have significantly different performance characteristics for that instruction, since in theory less attention might be paid to it during development than more frequently-used instructions. However, I've read the C source for the current Linux x86-64 implementation of sin() in glibc and I don't remember anything like that, nor do I see it in the callgrind disassembly (it seems to be doing its work "manually" using general-purpose AVX instructions). I've heard that glibc used to use fsin years ago, but my understanding is that they stopped because of its accuracy issues.

The only place I've found discussion of anything along the lines of what I'm seeing is an old thread on the GCC mailing list, but although it was interesting to look over I didn't notice anything there that clarified this (and I'd be wary about taking information from 2012 at face value anyway).


Solution

  • When you run a program under Callgrind or any other tool of the Valgrind family, it is disassembled on the fly. The intermediate representation is then instrumented, and translated back to the native instruction set.

    The profiling figures that Callgrind and Cachegrind give you are figures for the simplified processors they are modeling. As they don't have a detailed model of a modern CPU's pipeline, their results will not accurately reflect differences of actual performance (they can capture effects on the order of "this function executes 3x more instructions than the other function", but not "this instruction sequence can be executed with higher instruction-level parallelism").

    One of most important things of computing sin-like functions in a loop is allowing computations to be vectorized: on x86, SSE2 offers 2x vectorization factor for double, 4x for float. The compiler can achieve that more easily if you have inlinable branchless approximate functions, although a possibility exists with new enough Glibc and GCC too (but you need to pass a large subset of -ffast-math flags to GCC to achieve it).

    If you haven't seen it already: Arm's optimized-routines repository has a number of modern vectorizable implementations of several functions, including sin/cos in both single and double precision.

    P.S. sin should never returns a zero result for a tiny but non-zero argument. When x is close to zero, sin(x) and x differ by less than x*x*x, so as you approach zero, x becomes the closest representable number to sin x.