Search code examples
c++gccarmclangneon

returning Z flag under ARM NEON


I have a NEON function doing some comparisons:

inline bool all_ones(int32x4_t v) noexcept
{
  v = ~v;

  ::std::uint32_t r;

  auto high(vget_high_s32(int32x4_t(v)));
  auto low(vget_low_s32(int32x4_t(v)));

  asm volatile ("VSLI.I32 %0, %1, #16" : "+w"(high), "+w"(low));
  asm volatile ("VCMP.F64 %0, #0" : "=w"(high));
  asm volatile ("VMRS %0, FPSCR" : "=r"(r) : "w"(high));

  return r & (1 << 30);
}

Components (4 ints) of v can only be all ones or all zeros. If all 4 components are all ones, the function returns true and false otherwise. The return part expands into 3 instructions, which is a lot for me. Does there exist a better way to return the Z flag?

EDIT: After a long, hard pondering the above could have been replaced by:

inline bool all_ones(int32x4_t const v) noexcept
{
  return int32_t(-1) == int32x2_t(
    vtbl2_s8(
      int8x8x2_t{
        int8x8_t(vget_low_s32(int32x4_t(v))),
        int8x8_t(vget_high_s32(int32x4_t(v)))
      },
      int8x8_t{0, 4, 8, 12}
    )
  )[0];
}

There exists a mask extraction instruction in NEON.


Solution

  • You really don't want to mix NEON with VFP if you can avoid it.

    I suggest:

    bool all_ones(int32x4_t v) {
        int32x2_t l = vget_low_s32(v), h = vget_high_s32(v);
        uint32x2_t m = vpmin_u32(vreinterpret_u32_s32(l),
                                 vreinterpret_u32_s32(h));
        m = vpmin_u32(m, m);
        return vget_lane_u32(m, 0) == 0xffffffff;
    }
    

    If you're really sure the only non-zero value will be 0xffffffff then you can drop the comparison. Compiled standalone it might have a couple of unnecessary operations, but when it's inlined the compiler should fix that.