Does an instruction exist that gathers/extracts the first bit of an int[32]
and stores it into a int
?
I know about the intrinsic pext
but that is not really what I want.
I do have a code for that but I thought maybe there a designated instruction.
ints
array is zero besides the first bit. Ergo, no masking needed.void ints2bits(int &bits, int *ints) {
bits = (ints[0] << 0) + (ints[1] << 1) + ... + (ints[31] << 31);
}
Just tested harold suggestions. It works very well and I can attain nice speed up.
There is no single instruction that can even read that much data, but groups of 4 (8 with AVX2) can be processed quickly with the use of _mm_movemask_ps
. Ignore the fact that it claims to be a floating point instruction, it just gathers and appends the 4 top bits.
Of course moving the bottom bit into the top is easy with _mm_slli_epi32
.
So putting it together (not tested)
int res = 0;
for (int i = 0; i < 32; i += 4) {
__m128i x = _mm_load_si128((__m128i*)&ints[i]); // I assume it's aligned
x = _mm_slli_epi32(x, 31);
int bits = _mm_movemask_ps(_mm_castsi128_ps(x));
res += bits << i;
}
The extension to AVX2 is pretty obvious.
An other possible approach is shifting each lane by a variable amount (pre-AVX2 this requires multiplication) and then summing, first vertically of course, saving the horizontal sum for last. This is likely slower, and certainly more awkward.