Here is my code for adding all int16x4 element in a lane:
#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4
And I tried to add all int32x4 element in a lane.
but my code looks inefficient:
#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4
Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum
is 32 bits.
I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.
I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv
doesn't work on my host.
If you only want a 32-bit result, presumably either intermediate overflow is unlikely or you simply don't care about it, in which case you could just stay 32-bit all the way:
int32x2_t temp = vadd_s32(vget_high_s32(accl), vget_low_s32(accl));
int32x2_t temp2 = vpadd_s32(temp, temp);
int32_t sum1 = vget_lane_s32(temp2, 0);
However, using 64-bit accumulation isn't actually any more hassle, and can also be done without dropping out of NEON - it's just a different order of operations:
int64x2_t temp = vpaddlq_s32(accl);
int64x1_t temp2 = vadd_s64(vget_high_s64(temp), vget_low_s64(temp));
int32_t sum1 = vget_lane_s32(temp2, 0);
Either of those boils down to just 3 NEON instructions and no scalar arithmetic. The crucial trick on 32-bit ARM is that a pairwise add of two halves of a Q register is simply a normal add of two D registers - that doesn't apply to AArch64 where the SIMD register layout is different, but then AArch64 has the aforementioned horizontal addv
anyway.
Now, how horrible any of that looks in LLVM IR I don't know - I suppose it depends on how it treats vector types and operations internally - but in terms of the final ARM machine code both could be considered optimal.