I have a NEON function doing some comparisons:
inline bool all_ones(int32x4_t v) noexcept
{
v = ~v;
::std::uint32_t r;
auto high(vget_high_s32(int32x4_t(v)));
auto low(vget_low_s32(int32x4_t(v)));
asm volatile ("VSLI.I32 %0, %1, #16" : "+w"(high), "+w"(low));
asm volatile ("VCMP.F64 %0, #0" : "=w"(high));
asm volatile ("VMRS %0, FPSCR" : "=r"(r) : "w"(high));
return r & (1 << 30);
}
Components (4 ints) of v
can only be all ones or all zeros. If all 4 components are all ones, the function returns true
and false
otherwise. The return part expands into 3 instructions, which is a lot for me. Does there exist a better way to return the Z flag?
EDIT: After a long, hard pondering the above could have been replaced by:
inline bool all_ones(int32x4_t const v) noexcept
{
return int32_t(-1) == int32x2_t(
vtbl2_s8(
int8x8x2_t{
int8x8_t(vget_low_s32(int32x4_t(v))),
int8x8_t(vget_high_s32(int32x4_t(v)))
},
int8x8_t{0, 4, 8, 12}
)
)[0];
}
There exists a mask extraction instruction in NEON.
You really don't want to mix NEON with VFP if you can avoid it.
I suggest:
bool all_ones(int32x4_t v) {
int32x2_t l = vget_low_s32(v), h = vget_high_s32(v);
uint32x2_t m = vpmin_u32(vreinterpret_u32_s32(l),
vreinterpret_u32_s32(h));
m = vpmin_u32(m, m);
return vget_lane_u32(m, 0) == 0xffffffff;
}
If you're really sure the only non-zero value will be 0xffffffff
then you can drop the comparison. Compiled standalone it might have a couple of unnecessary operations, but when it's inlined the compiler should fix that.