I have 8 32-bit integers packed into __m256i
registers. Now I need to compare corresponding 32-bit values in two registers. Tried
__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);
that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.
Looking for an analogous intrinsic to quickly get indexes of the equal pairs.
Found a work-around (there is no _mm256_movemask_epi32
); is the cast legal here?
__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);
Yes, cast
intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd
/ vmovmaskps
.
Or if you can deal with each bit being repeated 4 times, vpmovmskb
also works, _mm256_movemask_epi8
. e.g. if you just want to test for any matches (i != 0
) or all-matches (i == 0xffffffff
) you can avoid using a ps
instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.
But if that would cost you extra instructions to e.g. scale by 4 after using _mm_tzcnt_u32
to find the element index instead of byte index of the first 1, then use the _ps
movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.