c assembly gcc compiler-optimization constantfolding

For what purpose does GCC create separate bitmasks applied to shifted items?

The following is a minimal reproducible example of code that I had to generate an 'array' (whose 1-byte elements are packed into the resulting uint_fast64_t) of 3D coordinates within octree branches given x z and y positions:

#include <stdint.h>
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
    static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
    *coord = (x * m & a) | (z * m & a) << 1 | (y * m & a) << 2;
}

Looking at the assembly, GCC appears to only generate one 'variant' of the m constant, but three variants of the a constant, including 0x404040404040404 and 0x202020202020202.

test:
        movabs  rax, 567382630219905 ; 0x2040810204081
        movzx   edx, dl
        movzx   esi, sil
        movzx   ecx, cl
        movabs  r8, 144680345676153346 ; 0x202020202020202
        imul    rdx, rax
        imul    rsi, rax
        imul    rcx, rax
        movabs  rax, 289360691352306692 ; 0x404040404040404
        add     rdx, rdx
        and     rdx, r8
        movabs  r8, 72340172838076673 ; 0x101010101010101
        and     rsi, r8
        sal     rcx, 2
        or      rdx, rsi
        and     rcx, rax
        or      rdx, rcx
        mov     QWORD PTR [rdi], rdx
        ret

For whatever reason, GCC appears to be 'constant propagating' the << 1 and << 2 to these masks, and storing them separately, when the same mask could just be used by doing the and first and then the bitshift. This is what is confusing.

Clang, on the other hand, fully propagates the bitshifts to the constants, so the assembly contains 6 of the 64 bit constants, but no shift operations corresponding to the << 1 and << 2. This seems to be a speed optimization at the cost of size.

But I am perplexed by the GCC handling. Some of the constants are 'folded' but some are not, and the manner in which they are not providing any perceptible benefit.

My question is:

Is there, for some obscure reason, some advantage to performing the shift first and the and mask later, even at the expense of storing extra constants in the code?
If not, is there some hack or compiler flag I can use to circumvent this, and force GCC to simply and first and shift afterwards, to avoid storing those constants?

This is one of those cases when 'The compiler will optimize the code, just forget about it.' does not really work, since this 'optimization' itself is what I find questionable.

I know that 16 bytes of opcode is 'not much' but I am still curious as to why GCC performs this 'optimization' despite seeming like a lose to an untrained eye. And this even happens with aggressive size optimizations.

Solution

I can only speculate that the GCC code generator is simply programmed to always compute bit masks relative to the final positions, even when it means that the number of bit masks is growing.

GCC has other heuristics as well, like reduction of immediates by 1, when comparing with inequality. if (a < 2) is transformed to if (a <= 1), which does not make sense if one also needs to to compute if (a == 2) for some other use.

One can anyway prevent GCC and clang from taking some optimisations by an optimisation barrier asm("" :"+r"(a)) -- combined with having the constants as non-const variables.

The barrier informs that register containing a is being somehow modified by the asm statement, meaning that a no longer contains the constant. Subsequently a << 1, a << 2 are also no longer immediates derivable from a.

void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
     uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
     asm("" : "+r"(a));
     uint_fast64_t xm = x * m & a;
     uint_fast64_t ym = y * m & a;
     uint_fast64_t zm = z * m & a;
    *coord = xm | (zm << 1) | (ym << 2);
}

In this particular case one can apparently also use

void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
    static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
    *coord = (x * m & a) + (z * m & a) * 2 + (y * m & a) * 4;
}

for

test:
        movabs  r8, 567382630219905
        movzx   ecx, cl
        movzx   edx, dl
        movabs  rax, 72340172838076673
        imul    rcx, r8
        movzx   esi, sil
        imul    rdx, r8
        imul    rsi, r8
        and     rcx, rax
        add     rcx, rcx
        and     rdx, rax
        add     rcx, rdx
        and     rsi, rax
        add     rcx, rcx
        add     rcx, rsi
        mov     QWORD PTR [rdi], rcx
        ret

In this case I would have actually expected lea rax, [rax + 4*rbx] format to be used, instead of two separate add rcx, rcx to left-shift by 1 as it accumulates in a longer dependency chain than necessary.