c++unsigned intrinsics integer-overflow avx2

overflow instead of saturation on 16bit add AVX2

I want to add 2 unsigned vectors using AVX2

__m256i i1 = _mm256_loadu_si256((__m256i *) si1);
__m256i i2 = _mm256_loadu_si256((__m256i *) si2);

__m256i result = _mm256_adds_epu16(i2, i1);

however I need to have overflow instead of saturation that _mm256_adds_epu16 does to be identical with the non-vectorized code, is there any solution for that?

Solution

Use normal binary wrapping _mm256_add_epi16 instead of saturating adds.

Two's complement and unsigned addition/subtraction are the same binary operation, that's one of the reasons modern computers use two's complement. As the asm manual entry for vpaddw mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn't mention signedness at all, so is less helpful at clearing up this confusion.)

Compares like _mm_cmpgt_epi32 are sensitive to signedness, but math operations (and cmpeq) aren't.

The intrinsics names Intel chose might look like they're for signed integers specifically, but they always use epi or si for things that work equally on signed and unsigned elements. But no, epu implies a specifically unsigned thing, while epi can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.

For example, _mm_and_si128 is pure bitwise. _mm_srli_epi32 is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that's _mm_srai_epi32 (shift right arithmetic by immediate). Shuffles like _mm_shuffle_epi32 just move data around in chunks.

Non-widening multiplication like _mm_mullo_epi16 and _mm_mullo_epi32 are also the same for signed or unsigned. Only the high-half _mm_mulhi_epu16 or widening multiplies _mm_mul_epu32 have unsigned forms as counterparts to their specifically signed epi16/32 forms.

That's also why 386 only added a scalar integer imul ecx, esi form, not also a mul ecx, esi, because only the FLAGS setting would differ, not the integer result. And SIMD operations don't even have FLAGS outputs.

The intrinsics guide unhelpfully describes _mm_mullo_epi16 as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for pmullw also describes it as signed that way, it seems talking about it as the companion to signed pmulhw. (And has some bugs, like describing the AVX1 VPMULLW xmm1, xmm2, xmm3/m128 form as multiplying 32-bit dword elements, probably a copy/paste error from pmulld)

And sometimes Intel's naming scheme is limited, like _mm_maddubs_epi16 is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for pmaddubsw to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for pmaddwd.