Search code examples
c++gccoptimizationcompiler-optimizationauto-vectorization

std::min vs ternary gcc auto vectorization with #pragma GCC optimize ("O3")


I know that "why is my compiler doing this" aren't the best type of questions, but this one is really bizarre to me and I'm thoroughly confused.

I had thought that std::min() was the same as the handwritten ternary (with maybe some compile time template stuff), and it seems to compile down into the same operation when used normally. However, when trying to make a "min and sum" loop autovectorize they don't seem to be the same, and I would love if someone could help me figure out why. Here is a small example code that produces the issue:

#pragma GCC target ("avx2")
#pragma GCC optimize ("O3")

#include <cstdio>
#include <cstdlib>
#include <algorithm>

#define N (1<<20)
char a[N], b[N];

int main() {
    for (int i=0; i<N; ++i) {
        a[i] = rand()%100;
        b[i] = rand()%100;
    }

    int ans = 0;
    #pragma GCC ivdep
    for (int i=0; i<N; ++i) {
        //ans += std::min(a[i], b[i]);
        ans += a[i]>b[i] ? a[i] : b[i];
    }
    printf("%d\n", ans);
}

I compile this on gcc 9.3.0, with the compilation command g++ -o test test.cpp -ftree-vectorize -fopt-info-vec-missed -fopt-info-vec-optimized -funsafe-math-optimizations.

And the code above as is debugs during compilation as:

test.cpp:19:17: optimized: loop vectorized using 32 byte vectors

In contrast, if I comment the ternary and uncomment the std::min, I get this:

test.cpp:19:17: missed: couldn't vectorize loop
test.cpp:20:35: missed: statement clobbers memory: _9 = std::min<char> (_8, _7);

So std::min() seems to be doing something unusual that prevents gcc from understanding that it is just a min operation. Is this something that is caused by the standard? Or is it an implementation failure? Or is there some compile flag that would make this work?


Solution

  • Summary: don't use #pragma GCC optimize. Use -O3 on the command line instead, and you'll get the behavior you expect.

    GCC's documentation on #pragma GCC optimize says:

    Each function that is defined after this point is treated as if it had been declared with one optimize(string) attribute for each string argument.

    And the optimize attribute is documented as:

    The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. [...] The optimize attribute should be used for debugging purposes only. It is not suitable in production code. [Emphasis added, thanks Peter Cordes for spotting the last part.]

    So, don't use it.

    In particular, it looks like specifying #pragma GCC optimize ("O3") at the top of your file is not actually equivalent to using -O3 on the command line. It turns out that the former doesn't result in std::min being inlined, and so the compiler actually does assume that it might modify global memory, such as your a,b arrays. This naturally inhibits vectorization.

    A careful reading of the documentation for __attribute__((optimize)) makes it look like each of the functions main() and std::min() will be compiled as if with -O3. But that's not the same as compiling the two of them together with -O3, as only in the latter case would interprocedural optimizations like inlining be available.

    Here is a very simple example on godbolt. With #pragma GCC optimize ("O3") the functions foo() and please_inline_me() are each optimized, but please_inline_me() does not get inlined. But with -O3 on the command line, it does.

    A guess would be that the optimize attribute, and by extension #pragma GCC optimize, causes the compiler to treat the function as if its definition were in a separate source file which was being compiled with the specified option. And indeed, if std::min() and main() were defined in separate source files, you could compile each one with -O3 but you wouldn't get inlining.

    Arguably the GCC manual should document this more explicitly, though I guess if it's only meant for debugging, it might be fair to assume it's intended for experts who would be familiar with the distinction.

    If you really do compile your example with -O3 on the command line, you get identical (vectorized) assembly for both versions, or at least I did. (After fixing the backwards comparison: your ternary code is computing max instead of min.)