The following is a minimal reproducible example of code that I had to generate an 'array' (whose 1-byte elements are packed into the resulting uint_fast64_t
) of 3D coordinates within octree branches given x
z
and y
positions:
#include <stdint.h>
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
*coord = (x * m & a) | (z * m & a) << 1 | (y * m & a) << 2;
}
Looking at the assembly, GCC appears to only generate one 'variant' of the m
constant, but three variants
of the a
constant, including 0x404040404040404
and 0x202020202020202
.
test:
movabs rax, 567382630219905 ; 0x2040810204081
movzx edx, dl
movzx esi, sil
movzx ecx, cl
movabs r8, 144680345676153346 ; 0x202020202020202
imul rdx, rax
imul rsi, rax
imul rcx, rax
movabs rax, 289360691352306692 ; 0x404040404040404
add rdx, rdx
and rdx, r8
movabs r8, 72340172838076673 ; 0x101010101010101
and rsi, r8
sal rcx, 2
or rdx, rsi
and rcx, rax
or rdx, rcx
mov QWORD PTR [rdi], rdx
ret
For whatever reason, GCC appears to be 'constant propagating' the << 1
and << 2
to these masks, and storing them separately, when the same mask could just be used by doing the and
first and then the bitshift. This is what is confusing.
Clang, on the other hand, fully propagates the bitshifts to the constants, so the assembly contains 6 of the 64 bit constants, but no shift operations corresponding to the << 1
and << 2
. This seems to be a speed optimization at the cost of size.
But I am perplexed by the GCC handling. Some of the constants are 'folded' but some are not, and the manner in which they are not providing any perceptible benefit.
My question is:
and
mask later, even at the expense of storing extra constants in the code?and
first and shift afterwards, to avoid storing those constants?This is one of those cases when 'The compiler will optimize the code, just forget about it.' does not really work, since this 'optimization' itself is what I find questionable.
I know that 16 bytes of opcode is 'not much' but I am still curious as to why GCC performs this 'optimization' despite seeming like a lose to an untrained eye. And this even happens with aggressive size optimizations.
I can only speculate that the GCC code generator is simply programmed to always compute bit masks relative to the final positions, even when it means that the number of bit masks is growing.
GCC has other heuristics as well, like reduction of immediates by 1, when comparing with inequality. if (a < 2)
is transformed to if (a <= 1)
, which does not make sense if one also needs to to compute if (a == 2)
for some other use.
One can anyway prevent GCC and clang from taking some optimisations by an optimisation barrier asm("" :"+r"(a))
-- combined with having the constants as non-const variables.
The barrier informs that register containing a
is being somehow modified by the asm
statement, meaning that a
no longer contains the constant. Subsequently a << 1, a << 2
are also no longer immediates derivable from a
.
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
asm("" : "+r"(a));
uint_fast64_t xm = x * m & a;
uint_fast64_t ym = y * m & a;
uint_fast64_t zm = z * m & a;
*coord = xm | (zm << 1) | (ym << 2);
}
In this particular case one can apparently also use
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
*coord = (x * m & a) + (z * m & a) * 2 + (y * m & a) * 4;
}
for
test:
movabs r8, 567382630219905
movzx ecx, cl
movzx edx, dl
movabs rax, 72340172838076673
imul rcx, r8
movzx esi, sil
imul rdx, r8
imul rsi, r8
and rcx, rax
add rcx, rcx
and rdx, rax
add rcx, rdx
and rsi, rax
add rcx, rcx
add rcx, rsi
mov QWORD PTR [rdi], rcx
ret
In this case I would have actually expected lea rax, [rax + 4*rbx]
format to be used, instead of two separate add rcx, rcx
to left-shift by 1 as it accumulates in a longer dependency chain than necessary.