Are C++20 new atomic_flag features supported in g++ / gcc?

According to cppreference, c++20 has rich (and, to me useful) support for atomic_flag operations.

However, it's not clear whether gcc yet supports these features, they're not anywhere to be found on gnu's feature summary. I'm currently using version 8, with -c++=2a set.

This code doesn't compile with GCC8:

#include <atomic>

int main() {
  std::atomic_flag myFlag = ATOMIC_FLAG_INIT;
  myFlag.test();
}

error: ‘struct std::atomic_flag’ has no member named ‘test’

I don't want to destabilize my build environment by installing a newer version of g++, and would be grateful to anyone who can report on the support for atomic_flag in version 10 or higher.

Solution

atomic<bool> does everything atomic_flag does, just as efficiently on all normal C++ implementations. C++20 just added new stuff to atomic_flag to bring it up to the level of atomic<bool>. atomic_flag is guaranteed to be lock_free, but in practice on all platforms anyone cares about, so is atomic<bool>.

Don't expect GCC8 to have all the C++2a features; at least try it on https://godbolt.org/ with latest release or nightly gcc. (Also note that it's not the compiler proper that needs to support this, just the standard library headers. But libstdc++ is normally distributed with g++.)

I tweaked your example so it could be compiled with optimization enabled without optimizing away the actual work.

#include <atomic>

int flagtest(std::atomic_flag &myFlag) {
  //std::atomic_flag myFlag = ATOMIC_FLAG_INIT;
  return myFlag.test();
}

On the Godbolt compiler explorer with gcc and clang: GCC10.2 doesn't support the new C++20 atomic_flag::test() member function, GCC nightly trunk build does. Clang 11.0 and trunk do, clang 10.0.1 doesn't.

# GCC trunk for x86-64 -O3 -std=gnu++2a
flagtest(std::atomic_flag&):
        movzx   eax, BYTE PTR [rdi]
        ret
booltest(std::atomic<bool>&):
        movzx   eax, BYTE PTR [rdi]
        test    al, al
        setne   al
        movzx   eax, al                # this is weird, GCC has gone insane.
        ret

With clang, we can also try libc++ (a new implementation of the C++ standard library). By default, clang on Linux (including Godbolt) uses libstdc++, like GCC does.

# clang 11.0 -O3 -std=gnu++2a -stdlib=libc++
flagtest(std::__1::atomic_flag&):
        mov     al, byte ptr [rdi]
        movzx   eax, al
        and     eax, 1
        ret
booltest(std::__1::atomic<bool>&):
        mov     al, byte ptr [rdi]
        movzx   eax, al
        and     eax, 1
        ret

So that's weird and horrible; even if the value in memory might not be booleanized, there's no reason to merge into the low byte of RAX with a byte mov and then movzx eax,al. Just do a movzx load in the first place! (Clang does have a tendency to be reckless with x86 false dependencies in general, but usually it at least saves a byte by using mov instead of movzx, if not a whole xor-zeroing instruction. But here it's costing an extra instruction.)

But and eax,1 is much less bad than GCC's insane test/setnz/movzx, if it thinks it needs to re-booleanize. (It doesn't actually need to do that; the ABI guarantees that a bool in memory is an actual 0 or 1 byte, and atomic<bool> uses the same object-representation as bool.)

So with clang, both ways have stupid missed-optimizations converting to int. With GCC for some reason atomic_flag doesn't suffer that problem, but I wouldn't recommend using it just for that reason. Hopefully atomic<bool> will get fixed, and normally you don't convert bool to int.

Normal uses of atomic<bool> or atomic_flag, like branching on it, should not have any of these missed optimizations. e.g.

int g0, g1;
int conditional_load(std::atomic<bool> &myFlag) {
    return myFlag ? g0 : g1;
}

# gcc 11 nightly build -O3
conditional_load(std::atomic<bool>&):
        movzx   eax, BYTE PTR [rdi]
        test    al, al
        mov     eax, DWORD PTR g0[rip]
        cmove   eax, DWORD PTR g1[rip]
        ret

So that's pretty normal. Clang chooses to select between addresses, then load once. That puts the load-use latency on the critical path and takes more instructions; worse choice when both vars are adjacent so probably come from the same cache line. (GCC's choice always touches both vars, could be worse if one could stay "cold" in cache).