I'm implementing a NEON version for arm64 of an algorithm I made.
The problem I'm facing is:
- How to unpack a int8x16
into two int16x8_t
, meaning that bytes are kind of "casted" to shorts?
- How to pack these two int16x8_t
back into a int8x16_t
?
The reason I am trying to do this is to apply operations on a couple of vectorized shorts, without overflowing, and finally packing back the result into a int8x16_t
.
Here is my SSE2 implementation for this problem:
SSE2 unpacking:
__m128i a1 = _mm_srai_epi16(_mm_unpacklo_epi8(input, input), 8);
__m128i a2 = _mm_srai_epi16(_mm_unpackhi_epi8(input, input), 8);
SSE2 packing:
__m128i output = _mm_packs_epi16(a1, a2);
You can do it e.g. like this with intrinsics:
#include <stdint.h>
#include <arm_neon.h>
void func(int8_t *buf) {
int8x16_t vec = vld1q_s8(buf); // load 16x int8_t
int16x8_t short1 = vmovl_s8(vget_low_s8(vec)); // cast the first 8x int8_t to int16_t
int16x8_t short2 = vmovl_s8(vget_high_s8(vec)); // cast the last 8x int8_t to int16_t
short1 = vaddq_s16(short1, short1); // Do operation on int16
short2 = vaddq_s16(short2, short2);
vec = vcombine_s8(vmovn_s16(short1), vmovn_s16(short2)); // Cast back to int8_t and combine the two vectors
vst1q_s8(buf, vec); // Store
}