Search code examples
c++assemblyneonarm64

ARM NEON aarch64: How to compare and update neon registers in optimized way?


Actually I am trying to figure out a good way to compare neon register values loaded from "unsigned short" array. As I'm working with a large project it's not possible to explain sharing the whole code portion. Rather I am going to share a similar example so that everyone can understand the actual problem scenario.


C++ Implementation:

unsigned short *values = new unsigned short[8];
for(int i=0; i<8; i++){
    if(values[i] > 255){
            values[i] = 255;
    }
}

Assembly Implementation:

MOV W3, #255
UMOV W2, V4.H[0]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[0], W2

UMOV W2, V4.H[1]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[1], W2

UMOV W2, V4.H[2]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[2], W2

UMOV W2, V4.H[3]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[3], W2

UMOV W2, V4.H[4]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[4], W2

UMOV W2, V4.H[5]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[5], W2

UMOV W2, V4.H[6]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[6], W2

UMOV W2, V4.H[7]
CMP W2, #0x00FF
CSEL W2,W3, W2, GT
MOV V4.H[7], W2

I know this is a bad assembly implementation for this scenario. Is it possible to perform this task with fewer instructions? I didn't find much assembly documentations regarding this compare and update instruction.
Any good idea will be highly appreciated. Thank you.


Solution

  • As others pointed out, you can use UMIN, or VMIN in 32bit neon. Sample implementation using neon intrinsics which works for 32 and 64 bit neon:

    #include <stdint.h>
    #include <arm_neon.h>
    
    void clamp8(uint16_t values[8])
    {
        uint16x8_t v = vld1q_u16(values);
        uint16x8_t x255 = vdupq_n_u16(255);
        uint16x8_t clamped = vminq_u16(v, x255);
        vst1q_u16(values, clamped);
    }
    

    This produces this arm64 neon code:

    ldr q0, [x0]
    movi v1.2d, #0xff00ff00ff00ff
    umin v0.8h, v0.8h, v1.8h
    str q0, [x0]