Search code examples
c++assemblyclangbit-manipulationcompiler-optimization

What type of analysis do compilers do to spot opportunities for reducing entires chunks of code with calls to builtins?


I'm not even sure I've phrased the question using the appropriate terminology, but below is what I mean (and here is the complete example on Compiler Explorer).

Take this code, which computes the number of set bits in the input int:

int foo(int a) {
    int count = 0;
    while (a) {
        ++count;
        a &= a - 1;
    }
    return count;
}

Clang, when targeting Haswell architecture, can compile it down to just one call to popcntl (even though some versions are adding a redundant instruction):

foo(int):                                # @foo(int)
        popcntl %edi, %eax
        retq

But why can't it do the same with the "dumber" version of it below?¹

int foo(int a) {
    int count = 0;
    while (a) {
        count += 1 & a;
        a >>= 1;
    }
    return count;
}

This compiles to

foo(int):                                # @foo(int)
        xor     eax, eax
        test    edi, edi
        je      .LBB0_2
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     ecx, edi
        and     ecx, 1
        add     eax, ecx
        sar     edi
        jne     .LBB0_1
.LBB0_2:
        ret

which maps very fairly well to the C++ snippet.

I would claim that the latter code (both C++ and its corresponding assembly) is closer to the human approach.

Still the same compiler, with the same options, doesn't see it's doing the same thing.

Why is that?

  • Is the compiler just applying a sort of pattern matching on the source code against a map of know idioms?

    This approach would mean that the compiler is just re-applying the human approach of reading the code and concluding what it is doing, so it just goes to the list of builtins it has and picks up the one that does the same thing; so if doesn't succeed in the second snippet, it just means that who wrote the compiler took into account the first snippet and not the second one (rightfully so!).

  • Or maybe the is just applying successive simplifications/realaborations to the snippet of code that happen to end up in the best scenario for the former snippet of code but not for the latter?

    If that's the case, then what is the crux of the difference between the two snippets that allows a happy ending or not?


¹ The only way I've test the code does the same is by verifying that the following main returns 0:

int main() {
    for (auto i = 0; i != 100000; ++i) {
        if (foo(i) != bar(i)) {
            return 1;
        }
    }
    return 0;
}

Solution

  • The point at which LLVM simplifies your first function is a LoopDeletionPass (see https://godbolt.org/z/z1GT7Kx9q) which turns the entire loop into:

    %0 = call i32 @llvm.ctpop.i32(i32 %a)
    

    You can see that this optimization pass is only performed after many others have been applied. Usually, optimization enables other optimizations, and in turn, those enable other optimizations. This is repeated until a fixed limit of passes is reached.

    After further optimization passes, you are left with the IR:

    define dso_local noundef i32 @foo(int)(i32 noundef %a) local_unnamed_addr {
    entry:
      %0 = icmp eq i32 %a, 0
      %1 = tail call i32 @llvm.ctpop.i32(i32 %a)
      %spec.select = select i1 %0, i32 0, i32 %1
      ret i32 %spec.select
    }
    

    Essentially, the compiler recognizes certain patterns (such as a loop that does certain things) and turns them into a set of instructions that does the same, but better (such as a call to an LLVM intrinsic function). This is called peephole optimization.

    How to ensure pattern recognition

    Whether an optimization can be performed often boils down to whether compiler developers have bothered specifying a pattern. Which patterns can be recognized might seem arbitrary.

    If you don't want to put blind faith into the compiler, use intrinsics such as __builtin_popcount or the C++20 function std::popcount.