Search code examples
clangsimdarm64micro-optimizationsve

Generate FMOV without inline assembly


I want to:

  • Move a 64-bit value from GPR to the lower 64-bits of a vector register
  • Do an operation (specifically bdep or bext)
  • Move the lower 64-bits of my vector register to a GPR

This doesn't seem to be possible using ACLE intrinsics.

This is the closest I can get using intrinsics: https://godbolt.org/z/brjG6fe38

    const auto vec = svbdep_u64(svdup_n_u64(a), svdup_n_u64(b));
    return svlastb_u64(svptrue_b64(), vec);

which Clang compiles to

foo(unsigned long, unsigned long):
        mov     z0.d, x0
        ptrue   p0.d
        mov     z1.d, x1
        bdep    z0.d, z0.d, z1.d
        lastb   x0, p0, z0.d
        ret

The compiler is able to replace dup with mov, which is great. However, it still generates lastb, which is completely wasteful since I only need the last 64 bits. An fmov would do just fine.

Am I missing something, or is this basic operation not supported by ACLE intrinsics?


Solution

  • It turns out there is a portable solution, so the non-portable workaround from Peter Cordes is not necessary:

    #include <arm_neon_sve_bridge.h>
    
    uint64_t foo(uint64_t a, uint64_t b) {
        const auto vec = svbdep_u64(svdup_n_u64(a), svdup_n_u64(b));
        return vgetq_lane_u64(svget_neonq_u64(vec), 0);
    }
    

    See https://github.com/ARM-software/acle/issues/374#issuecomment-2568181600 for more context.

    Godbolt: https://godbolt.org/z/d69zjGMEE