ARM NEON Intrinsics: Limit values of a vector to 0-255

Say I have an int16x8_t vector. I want to limit the range of its values to 0-255 and convert it to an uint8x8_t vector. Reading the vector into an array and doing it the traditional non-intrinsic way is waaaay too slow. Is there a faster way?

Solution

All you need is the single instruction vqmovun.s16, vqmovun_s16 in intrinsics.

Vector Saturating(q) Move Unsigned Narrow

int16x8_t input;
uint8x8_t result;
.
.
.
.
.
.

result = vqmovun_s16(input);

Any negative element will be replaced with 0 while all the numbers bigger than 255 will be set as 255 then narrowed to unsigned 8bit elements, and all these in a single cycle, EXACTLY what you need.

There is also vqmovn_s16 which keeps the values signed (-128~127)

PS: Are you working on YUV to RGB conversion? That's the one time I needed this instruction.