Search code examples
c++performanceoptimizationintrinsicsavx2

Analog of _mm256_cmp_epi32_mask for AVX2


I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried

__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);

that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.

Looking for an analogous intrinsic to quickly get indexes of the equal pairs.

Found a work-around (there is no _mm256_movemask_epi32); is the cast legal here?

__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Solution

  • Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps.

    Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8. e.g. if you just want to test for any matches (i != 0) or all-matches (i == 0xffffffff) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.

    But if that would cost you extra instructions to e.g. scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.