How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic:

int mask = _mm256_movemask_epi8(__m256i s1)

creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8, i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask.

What is the best way to do this?

Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t. As such, in the resulting __m256i mask, I can ignore the bits other than the MSB of each byte.

Solution

Here is an alternative to LUT or pdep instructions that might be more efficient:

Copy your 32-bit mask to both low bytes of some ymm register and bytes 16..19 of the same register. You could use temporary array and _mm256_load_si256. Or you could move single copy of 32-bit mask to low bytes of some ymm register, then broadcast it with VPBROADCASTD (_mm_broadcastd_epi32) or other broadcast/shuffle instructions.
Rearrange bytes of the register so that low 8 bytes (each) contain low 8 bits of your mask, next 8 bytes - next 8 bits, etc. This could be done with VPSHUFB (_mm256_shuffle_epi8) with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc.
Select proper bit for each byte with VPOR (_mm256_or_si256) or VPAND (_mm256_and_si256).
Set MSB of appropriate bytes with VPCMPEQB (_mm256_cmpeq_epi8). Compare each byte to 0xFF. If you want each bit of the mask toggled, use VPAND on previous step and compare to zero.

Additional flexibility of this approach is that you could choose different control register for step #2 and different mask for step #3 to shuffle bits of your bit mask (for example you could copy this mask to ymm register in reversed order).