The intrinsic:
int mask = _mm256_movemask_epi8(__m256i s1)
creates a mask, with its 32
bits corresponding to the most significant bit of each byte of s1
. After manipulating the mask using bit operations (BMI2
for example) I would like to perform the inverse of _mm256_movemask_epi8
, i.e., create a __m256i
vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask
.
What is the best way to do this?
Edit:
I need to perform the inverse because the intrinsic _mm256_blendv_epi8
accepts only __m256i
type mask instead of uint32_t
. As such, in the resulting __m256i
mask, I can ignore the bits other than the MSB of each byte.
Here is an alternative to LUT or pdep
instructions that might be more efficient:
ymm
register and bytes 16..19 of the same register. You could use temporary array and _mm256_load_si256
. Or you could move single copy of 32-bit mask to low bytes of some ymm
register, then broadcast it with VPBROADCASTD (_mm_broadcastd_epi32)
or other broadcast/shuffle instructions.VPSHUFB (_mm256_shuffle_epi8)
with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc.VPOR (_mm256_or_si256)
or VPAND (_mm256_and_si256)
.VPCMPEQB (_mm256_cmpeq_epi8)
. Compare each byte to 0xFF
. If you want each bit of the mask toggled, use VPAND
on previous step and compare to zero.Additional flexibility of this approach is that you could choose different control register for step #2 and different mask for step #3 to shuffle bits of your bit mask (for example you could copy this mask to ymm
register in reversed order).