What doest `vaddhn_high_s16` actually do?

There is vaddhn_high_s16 intrinsic for arm64.

The official ARM documentation for this intrinsic is here. However, the given description, and pseudo code, all make me confusing.

Can anyone using practical C/C++ code to explain what does vaddhn_high_s16 do?

For example, assuming all datatypes are defined, and vmulq_f32 intrinsic can be explained with this implementation:

float32x4_t vmulq_f32(float32x4_t a, float32x4_t b)
{
    float32x4_t r;
    for (int i=0; i<4; i++)
    {
        r[i] = a[i] * b[i];
    }
    return r;
}

Solution

The documentation of the underlying addhn2 instruction in the ARMv8 Architecture Reference Manual helps clarify things. This is usually a good resource for questions about intrinsics.

The main purpose, of course, is to add 16-bit values and keep only the high 8 bits of each result. The addhn2 form writes the result to the top 8 bytes of a SIMD register, with the low 8 bytes remaining unchanged. Since C is pass-by-value and "modify in place" isn't easy to represent in a C function, the intrinsic has you pass the desired low bytes as an argument, which pass through into the low bytes of the return value; the high bytes of the return value contain the result of the addition.

So you could express it as:

int8x16_t vaddhn_high_s16(int8x8_t r, int16x8_t a, int16x8_t b) {
    int8x16_t ret;
    for (int i = 0; i < 8; i++)
        ret[i] = r[i];
    for (int i = 0; i < 8; i++)
        ret[i+8] = (int8_t)((a[i] + b[i]) >> 8);
    return ret;
}