arm clang bit-manipulation micro-optimization neon

Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?

The simple test,

unsigned f(unsigned long long x) {
    return __builtin_popcountll(x);
}

when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,^⁎ results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x in parallel, then add the results.

It seems to me from skimming the architecture manuals that NEON code similar to that generated for

#include <arm_neon.h>
unsigned f(unsigned long long x) {
    uint8x8_t v = vcnt_u8(vcreate_u8(x));
    return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}

should have been beneficial in terms of size at least, even if not necessarily a performance improvement.

Why doesn’t Clang^† do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)

^{^⁎ Clang options could be wrong or redundant, adjust as necessary.

^† GCC doesn’t seem to do that in my experiments, either, just emitting a call to __popcountdi2, but that suggests I might simply be calling it wrong.}

Solution

Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it?

Well you asked very right question.
Shortly, yes, it's. It's slow and in most cases moving data between NEON and ARM CPU and vise-versa is a big performance penalty that over performance gain from using 'fast' NEON instructions.

In details, NEON is a optional co-processor in ARMv7 based chips. ARM CPU and NEON work in parallel and I might say 'independently' from each other.
Interaction between CPU and NEON co-processor is organised via FIFO. CPU places neon instruction in FIFO and NEON co-processor fetch and execute it. Delay comes at the point when CPU and NEON needs sync between each other. Sync is accessing same memory region or transfering data between registers.

So whole process of using vcnt would be something like:

ARM CPU placing vcnt into NEON FIFO
Moving data from CPU register into NEON register
NEON fetching vcnt from FIFO
NEON executing vcnt
Moving data from NEON register to CPU register

And all that time CPU is simply waiting while NEON is doing it's work.
Due to NEON pipelining, delay might be up to 20 cycles (if I remember this number correctly).

Note: "up to 20 cycles" is arbitrary, since if ARM CPU has other instructions that does not depend on result of NEON computations, CPU could execute them.

Conclusion: as a rule of thumb that's not worthy, unless you are manually optimise code to reduce/eliminate that sync delays.

PS: That's true for ARMv7. ARMv8 has NEON extension as part of a core, so it's not relevant.