Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

Consider the following declaration of local variables:

bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};

in x86-64 architectures, I'd expect the optimizer to reduce the initialization of those variables to something like mov qword ptr [rsp], 0. But instead what I get with all the compilers (regardless of level of optimization) I've been able to try is some form of:

mov     byte ptr [rsp + 7], 0
mov     byte ptr [rsp + 6], 0
mov     byte ptr [rsp + 5], 0
mov     byte ptr [rsp + 4], 0
mov     byte ptr [rsp + 3], 0
mov     byte ptr [rsp + 2], 0
mov     byte ptr [rsp + 1], 0
mov     byte ptr [rsp], 0

Which seems like a waste of CPU cycles. Using copy-initialization, value-initialization or replacing braces with parentheses made no difference.

But wait, that's not all. Suppose that I have this instead:

struct
{
    bool a{false};
    bool b{false};
    bool c{false};
    bool d{false};
    bool e{false};
    bool f{false};
    bool g{false};
    bool h{false};
} bools;

Then the initialization of bools generates exactly what I'd expect: mov qword ptr [rsp], 0. What gives?

You can try the code above yourself in this Compiler Explorer link.

The behavior of the different compilers is so consistent that I am forced to think there is some reason for the above inefficiency, but I have not been able to find it. Do you know why?

Solution

Compilers are dumb, this is a missed-optimization. mov qword ptr [rsp], 0 would be optimal. Store forwarding from a qword store to a byte reload of any individual byte is efficient on modern CPUs. (https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/)

(Or even better, push 0 instead of sub rsp, 8 + mov, also a missed optimization because compilers don't bother looking for cases where that's possible.)

Presumably the optimization pass that looks for store merging runs before nailing down the locations of locals in the stack frame relative to each other. (Or before even deciding which locals can be kept in registers and which need memory addresses at all.)

Store merging aka coalescing was only recently reintroduced in GCC8 IIRC, after being dropped as a regression from GCC2.95 to GCC3, again IIRC. (I think other optimizations like assuming no strict-aliasing violations to keep more vars in registers more of the time, were more useful). So it's been missing for decades.

From one POV, you could say consider yourself lucky you're getting any store merging at all (with struct members, and array elements, that are known early to be adjacent). Of course, from another POV, compilers ideally should make good asm. But in practice missed optimizations are common. Fortunately we have beefy CPUs with wide superscalar out-of-order execution to usually chew through this crap to still see upcoming cache miss loads and stores pretty quickly, so wasted instructions sometimes have time to execute in the shadow of other bottlenecks. That's not always true, and clogging up space in the out-of-order execution window is never a good thing.

Related: In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values? covers the general case for constants other than 0, re: what the optimal asm would be. (The difference between array vs. separate locals was only discussed in comments there.)