Pairwise addition of 64-bit values in an __m512i?

I plan to use _mm512_popcnt_epi64() to get an __m512i vector containing eight 64-bit values. I need to add those values in a pairwise fashion to get any of the following:

an __m512i vector containing four 128-bit values
an __m256i vector containing four 64-bit values
an __m128i vector containing four 32-bit values

Is there a good way to do this on Zen4?

Solution

__m128i _mm512_cvtepi64_epi16( __m512i a); (vpmovqw) will narrow 64-bit elements to 16-bit. From there you can horizontally add pairs with _mm_madd_epi16(v, _mm_set1_epi16(0x0001)) (pmaddwd), or with shift / add / AND, or shift / zero-masked add.

Narrowing to less than 512-bit as a first step is good for Zen4, since most 512-bit operations take extra cycles in the execution units (worse throughput and latency).

If you actually wanted a __m512i you'd just shuffle within lanes for a zero-masked vpaddq, or a __m256i could start with vpmovqd to only narrow in half, setting up for _mm256_srli_epi64(v, 32) and _mm256_maskz_add_epi32(0x55, shifted, v)

Mask register setup apparently sucks on Zen 4, with kmovb k, r32 costing 2 uops alone (https://uops.info), so if this isn't in a loop you might want to just use a vector constant for vpand. Or shift left then right, like srli( add(v, slli(v, 32)), 32). But once you have a mask in a mask register, using it is fine: vpaddd with zero-masking is 4/clock throughput on XMM/YMM registers, with 1 cycle latency for zero-masking. (Or 2 cycles for one of the inputs in merge-masking).