Fellow ARMists,
I'd like to narrow and saturate 2 s32 to 2 s16 with NEON code, and pack them in a GPR.
I need to conform to a certain API, so please don't discuss efficiency or design here :)
Here's the snippet:
int32x2_t stuff32 = ...;
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, stuff32));
return vget_lane_u32(stuff16, 0)
Which generates
mov v0.d[1], v0.d[0]
sqxtn v0.4h, v0.4s
fmov w0, s0
ret
Does somebody know a way to keep the type system happy, and have the second half of the d register uninitialized ? I'd like to avoid inline assembly.
Thank you !
I'm not aware of any good solution using general arm_neon.h
intrinsics, but at least with Clang, it's possible, using Clang specific builtins, to produe a vector where some elements are set to be undefined, so the codegen doesn't need to fill them with any value in particular.
A setup that uses that would look like this:
$ cat test.c
#include <arm_neon.h>
int32_t narrow_saturate(int32x2_t stuff32) {
int16x4_t stuff16 = vqmovn_s32(__builtin_shufflevector(stuff32, stuff32, 0, 1, -1, -1));
return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
}
$ clang -target aarch64-linux-gnu test.c -S -o - -O2
[...]
narrow_saturate:
sqxtn v0.4h, v0.4s
fmov w0, s0
ret
See https://clang.llvm.org/docs/LanguageExtensions.html#builtin-shufflevector for documentation on __builtin_shufflevector
.
EDIT: It also seems to be possible to achieve the same with Clang by using an uninitialized variable (although that can generate warnings with `-Wuninitialized):
$ cat test.c
#include <arm_neon.h>
int32_t narrow_saturate(int32x2_t stuff32) {
int32x2_t uninitialized;
int16x4_t stuff16 = vqmovn_s32(vcombine_s32(stuff32, uninitialized));
return vget_lane_s32(vreinterpret_s32_s16(stuff16), 0);
}
Clang produces the same as above for this (https://godbolt.org/z/TzHuon), while GCC still includes a mov v0.8b, v0.8b
(https://godbolt.org/z/wZTAU9).