Actually I am trying to figure out a good way to compare neon register values loaded from "unsigned short" array. As I'm working with a large project it's not possible to explain sharing the whole code portion. Rather I am going to share a similar example so that everyone can understand the actual problem scenario.
C++ Implementation:
unsigned short *values = new unsigned short[8];
for(int i=0; i<8; i++){
if(values[i] > 255){
values[i] = 255;
}
}
Assembly Implementation:
MOV W3, #255
UMOV W2, V4.H[0]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[0], W2
UMOV W2, V4.H[1]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[1], W2
UMOV W2, V4.H[2]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[2], W2
UMOV W2, V4.H[3]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[3], W2
UMOV W2, V4.H[4]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[4], W2
UMOV W2, V4.H[5]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[5], W2
UMOV W2, V4.H[6]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[6], W2
UMOV W2, V4.H[7]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[7], W2
I know this is a bad assembly implementation for this scenario. Is it possible to perform this task with fewer instructions? I didn't find much assembly documentations regarding this compare and update instruction.
Any good idea will be highly appreciated. Thank you.
As others pointed out, you can use UMIN, or VMIN in 32bit neon. Sample implementation using neon intrinsics which works for 32 and 64 bit neon:
#include <stdint.h>
#include <arm_neon.h>
void clamp8(uint16_t values[8])
{
uint16x8_t v = vld1q_u16(values);
uint16x8_t x255 = vdupq_n_u16(255);
uint16x8_t clamped = vminq_u16(v, x255);
vst1q_u16(values, clamped);
}
This produces this arm64 neon code:
ldr q0, [x0]
movi v1.2d, #0xff00ff00ff00ff
umin v0.8h, v0.8h, v1.8h
str q0, [x0]