Collapse __mask64 aka 64-bit integer value, counting nibbles that have all bits set?

I have a __mask64 as a result of a few AVX512 operations:

__mmask64 mboth = _kand_mask64(lres, hres);

I would like to count the number of nibbles in this that have all bits set (0xF).

The simple solution is to do this:

uint64 imask = (uint64)mboth;
while (imask) {
    if (imask & 0xf == 0xf)
        ret++;
    imask = imask >> 4;
}

I wanted something better, but what I came up with doesn't feel elegant:

    //outside the loop
    __m512i b512_1s = _mm512_set1_epi32(0xffffffff);
    __m512i b512_0s = _mm512_set1_epi32(0x00000000);

    //then...
    __m512i vboth = _mm512_mask_set1_epi8(b512_0s, mboth, 0xff);
    __mmask16 bits = _mm512_cmpeq_epi32_mask(b512_1s, vboth);
    ret += __builtin_popcount((unsigned int)fres);

The above puts a 0xff byte into a vector where a 1 bit exists in the mask, then gets a 1-bit in the bits mask when the blown-up 0xf nibbles now are now found as 0xffffffff int32's.

I feel that two 512-bit operations are way overkill when the original data lives in a 64-bit number. This alternate is probably much worse; it's too many instructions and still operates on 128 bits:

    //outside the loop
    __m128i b128_1s = _mm_set1_epi32(0xffffffff);

    //then...
    uint64 maskl = mboth & 0x0f0f0f0f0f0f0f0f;
    uint64 maskh = mboth & 0xf0f0f0f0f0f0f0f0;
    uint64 mask128[2] = { (maskl << 4) | maskl, (maskh >> 4) | maskh };
    __m128i bytes   = _mm_cmpeq_epi8(b128_1s, *(__m128i*)mask128);
    uint bits = _mm_movemask_epi8(bytes);
    ret += __builtin_popcount(bits);

Solution

With just some scalar operations you can do this:

imask &= imask << 2;
imask &= imask << 1;
ret += std::popcount(imask & 0x8888888888888888);

The first two steps put, for every nibble, the horizontal AND of the bits of that nibble in the most significant bit of that nibble. The other bits of the nibble become something that we don't want here so we just mask them out. Then popcount the result.

The shifts could go to the right (as in an earlier version of this answer) or they could be rotates, whichever works out best.

Clang makes efficient asm from this version, with no wasted instructions other than xor-zeroing ahead of popcnt which should go away with inlining since it can popcnt same,same even without planning ahead to have the result in EAX for the calling convention.

GCC does ok, but reorders the &= mask last so it's part of the critical-path latency, not in parallel with the shifts, despite our best efforts to make the source look like single asm operations to try to hand-hold it into making better asm.

MSVC is weird with this, turning it into right shifts, as well as doing the &= mask last like GCC.

// Clang compiles this optimally, hopefully also when inlining
// GCC still does the & mask last, on the critical path
// MSVC mangles this, with two right shifts despite the source going left, and deoptimizes latency like GCC
int count_nibbles(uint64_t imask)
{
    uint64_t mask = 0x2222222222222222;  // movabs, or hoisted out of a loop
    uint64_t shifted = imask << 1;   // LEA dst, [src+src] into a new reg
    shifted &= imask;                // AND
    shifted >>= 2;                   // SHR
    imask &= mask;                   // AND into original reg, in parallel with the shift/AND chain
    shifted &= imask;                // AND
    return std::popcount(shifted);   // POPCNT
}

This version also prevents clang from deoptimizing a shift or rotate into lea reg, [0 + reg*4] which is 8 bytes long and has 2-cycle latency on Alder Lake / Sapphire Rapids. (https://uops.info/).

Godbolt for this and several other versions (including a portable version of chtz's ADD/ADC trick). Using asm("" : "+r"(imask)) at a certain point in the function can force GCC not to deoptimize the order of operations, but that could stop it optimizing this as part of a larger loop.

Writing this with multiple operations on the same source line doesn't hurt anything for Clang, and doing it this way still didn't stop GCC from screwing it up, but this does illustrate what optimal asm should be like. You might prefer to compact it back up into fewer C statements.

GCC's reordering to group shift-and-AND together is useful in general for AArch64, where and x1, x0, x0, lsr 2 is possible. But even then, instruction-level parallelism would be possible while still only using 3 AND instructions, two with shifted operands. GCC/Clang/MSVC miss that optimization. AArch64 repeated-pattern immediates for bitwise instructions do allow 0x2222222222222222 or 0x8888888888888888, so no separate constant setup is needed.