Search code examples
gccbuilt-in

What's the sense of gcc's `__builtin_popcountll`?


I implemented bitsets (bitmaps) in C. Assuming that __builtin_popcountll is a highly efficient implementation I used that to count bits instead of doing my own implementation.

However when debugging the program it looks as if __builtin_popcountll is using some loop (and not what I had expected: Assembly instructions). Actually when profiling my test program with gprof, __popcountdi2 consumed 12% of the total CPU, while the "code around" using it took "0%".

So I wonder: What is the use of such builtin when it's seemingly so inefficient?

Platform is x86_64 (AMD EPYC 7401P) using gcc 4.8.5.


Solution

  • Per https://stackoverflow.com/a/52161813/6607497 the code generated by __builtin_popcountll depends on the setting of gcc's option -march=CPU-TYPE:

    • For Intel CPUs the first to support the SSE4a popcnt instruction is CPU-TYPE corei7
    • For AMD CPUs it's CPU-TYPE amdfam10 or barcelona.
    • Using CPU-TYPE native will also enable it if the current CPU knows the instruction.

    When set properly (i.e.: the popcnt instruction is known by the CPU), gcc creates corresponding inline assembly instructions; otherwise it will call __popcountdi2 (from libgcc2) that counts the bits performing a loop (one iteration per byte).