Search code examples
armsimdintrinsicsarm64neon

How to extend a int32x2_t to a int32x4_t with NEON intrinsics on clang/AArch64 when you don't care about the new lanes?


Fellow ARMists,
I'd like to narrow and saturate 2 s32 to 2 s16 with NEON code, and pack them in a GPR. I need to conform to a certain API, so please don't discuss efficiency or design here :)
Here's the snippet:

int32x2_t stuff32 = ...;
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, stuff32));
return vget_lane_u32(stuff16, 0)

Which generates

mov    v0.d[1], v0.d[0] 
sqxtn  v0.4h, v0.4s 
fmov   w0, s0 
ret           

Does somebody know a way to keep the type system happy, and have the second half of the d register uninitialized ? I'd like to avoid inline assembly.
Thank you !


Solution

  • I'm not aware of any good solution using general arm_neon.h intrinsics, but at least with Clang, it's possible, using Clang specific builtins, to produe a vector where some elements are set to be undefined, so the codegen doesn't need to fill them with any value in particular.

    A setup that uses that would look like this:

    $ cat test.c
    #include <arm_neon.h>
    
    int32_t narrow_saturate(int32x2_t stuff32) {
      int16x4_t stuff16 = vqmovn_s32(__builtin_shufflevector(stuff32, stuff32, 0, 1, -1, -1));
      return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);     
    }
    
    $ clang -target aarch64-linux-gnu test.c -S -o - -O2
    [...]
    narrow_saturate:
            sqxtn   v0.4h, v0.4s
            fmov    w0, s0
            ret
    

    https://godbolt.org/z/N_NsSE

    See https://clang.llvm.org/docs/LanguageExtensions.html#builtin-shufflevector for documentation on __builtin_shufflevector.

    EDIT: It also seems to be possible to achieve the same with Clang by using an uninitialized variable (although that can generate warnings with `-Wuninitialized):

    $ cat test.c
    #include <arm_neon.h>
    
    int32_t narrow_saturate(int32x2_t stuff32) {
      int32x2_t uninitialized;
      int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, uninitialized));
      return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
    }
    

    Clang produces the same as above for this (https://godbolt.org/z/TzHuon), while GCC still includes a mov v0.8b, v0.8b (https://godbolt.org/z/wZTAU9).