This code comes from https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp.
std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {
size_t i = 0;
const __m256i lookup = _mm256_setr_epi8(
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
);
const __m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i acc = _mm256_setzero_si256();
#define ITER { \
const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
const __m256i lo = _mm256_and_si256(vec, low_mask); \
\\\ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ why do we need this?
const __m256i hi = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
local = _mm256_add_epi8(local, popcnt1); \
local = _mm256_add_epi8(local, popcnt2); \
i += 32; \
}
while (i + 8*32 <= n) {
__m256i local = _mm256_setzero_si256();
ITER ITER ITER ITER
ITER ITER ITER ITER
acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
}
...rest are unrelated to the question
The code is used to replace the builtin_popcnt function, which counts the number of 1s in a given input in binary format. what bothers me are these two lines:
const __m256i lo = _mm256_and_si256(vec, low_mask); \
const __m256i hi = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
according to Intel intrinsic guide https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=6392,305,6221,6389,6389,6221,6188,6769,6389,124,6050,6389&text=mm256_shuffle ,the _mm256_shuffle_epi8
instruction only looks at the lower 4 bits of your packed chars b:
__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
FOR j := 0 to 15
i := j*8
IF b[i+7] == 1
dst[i+7:i] := 0
ELSE
index[3:0] := b[i+3:i]
\\\ ^^^^^^^^^^^^^^^^^^^^^^ only look at lower 4 bits
dst[i+7:i] := a[index*8+7:index*8]
FI
IF b[128+i+7] == 1
dst[128+i+7:128+i] := 0
ELSE
index[3:0] := b[128+i+3:128+i]
dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]
FI
ENDFOR
dst[MAX:256] := 0
So if I'm not mistaken, you can just do
const __m256i lo = vec; \
const __m256i hi = _mm256_srli_epi16(vec, 4); \
I'm sort of new to AVX, Please tell me if there's anything wrong here.
[v]pshufb
looks at the high bit to zero that output element, unfortunately. In the pseudocode you quoted:
IF b[i+7] == 1 # if high-bit set
dst[i+7:i] := 0 # zero that output element
ELSE
... the part you were looking at # else index the source
Tthe intrinsics guide only covers it in the pseudocode, not the text.
As usual, the asm manual entry's description is much more descriptive:
If the most significant bit (bit[7]) of each byte of the shuffle control mask is set, then constant zero is written in the result byte
It's useful for some problems, but for pshufb
as a nibble-LUT it does require 2 [v]pand
instructions. Including for the high nibbles, because x86 doesn't have a SIMD byte shift. The narrowest being psrlw
16-bit elements, so even the every other byte will get garbage shifted into its high bit. Unless your input data is known to always have those bit-positions clear.
AVX-512VBMI (Ice Lake and newer) vpermb
doesn't have this downside, but is lane-crossing so it has 3c latency instead of 1 on CPUs that support it. Luckily it is still only 1 uop on Ice Lake, unlike vperm2tw
and vpermt2b
even on Ice Lake (https://uops.info).
But it will could be slower on any future CPUs that do AVX-512 by decoding into 2x 256-bit halves, like some future Intel Efficiency cores. (Alder Lake E-cores have 128-bit wide EU, and already split 256-bit vectors in two halves, and supporting AVX-512 with 4 uops per instruction would start to get silly, I guess. And unfortunately Intel didn't design a way to expose the new AVX-512 functionality at only 128 and 256-bit width (like masking and better shuffles, vpternlogd
, etc.))
Zen 4 has efficient handling of 512-bit instructions, still single-uop with at worst half throughput of 256-bit ops, the same uop occupying an execution unit for 2 cycle.
So unlike Zen 1 where lane-crossing AVX1/2 shuffles like vpermq
and vperm2f128
were several uops because the shuffle units were truly only 128-bit wide, Zen 4 has 1/clock throughput for vpermb zmm
, vs. 2/clock for vpermb ymm/xmm
. The 512-bit version has 6 cycle latency, up from 4 cycle for ymm, 2 cycle for xmm. (https://uops.info/)
Using vpermb
as a drop-in replacement for vpshufb
, the LUT can still be broadcast-loaded from a 16-byte source, since it just repeats in each lane. Then you can leave bits above the 4th unzeroed, as long as index 0, 16, 32, and 48 all read the same value, etc.
Or of course it opens up the possibility of a wider LUT, like for extremely efficient base64 encoding with vpmultishiftqb
for parallel bitfield extraction. (https://github.com/aklomp/base64/blob/master/lib/arch/avx512/enc_reshuffle_translate.c or https://github.com/WojciechMula/base64simd)