I plan to use _mm512_popcnt_epi64() to get an __m512i vector containing eight 64-bit values. I need to add those values in a pairwise fashion to get any of the following:
Is there a good way to do this on Zen4?
__m128i _mm512_cvtepi64_epi16( __m512i a);
(vpmovqw
) will narrow 64-bit elements to 16-bit. From there you can horizontally add pairs with _mm_madd_epi16(v, _mm_set1_epi16(0x0001))
(pmaddwd
), or with shift / add / AND, or shift / zero-masked add.
Narrowing to less than 512-bit as a first step is good for Zen4, since most 512-bit operations take extra cycles in the execution units (worse throughput and latency).
If you actually wanted a __m512i
you'd just shuffle within lanes for a zero-masked vpaddq
, or a __m256i
could start with vpmovqd
to only narrow in half, setting up for _mm256_srli_epi64(v, 32)
and _mm256_maskz_add_epi32(0x55, shifted, v)
Mask register setup apparently sucks on Zen 4, with kmovb k, r32
costing 2 uops alone (https://uops.info), so if this isn't in a loop you might want to just use a vector constant for vpand
. Or shift left then right, like srli( add(v, slli(v, 32)), 32)
. But once you have a mask in a mask register, using it is fine: vpaddd
with zero-masking is 4/clock throughput on XMM/YMM registers, with 1 cycle latency for zero-masking. (Or 2 cycles for one of the inputs in merge-masking).