HAKMEM Hamming Weight bithack has a bug, any way to save it?

;if A is a 9 bit quantity, B gets number of 1's (Schroeppel)
  IMUL A,[1001001001] ;4 copies
  AND A,[42104210421] ;every 4th bit
  IDIVI A,17 ;casting out 15.'s in hexadecimal

This function seems to need a 33rd bit to count the bit in the 32s place.

uint32_t i = 0b11101011;
uint32_t u = i * (uint32_t)01001001001;
uint32_t x = u & (uint32_t)042104210421;
v = x % 017;
std::cout << "i: " << std::bitset<8>(i) << ", u: " << std::bitset<32>(u) <<
", x: " << std::bitset<32>(x) << ", v: " << v << std::endl;

Gives:

i: 11101011
u: 01011011101011011101011011101011
x: 00010001000000010001000000000001
v: 5

But:

uint64_t v = i;
uint64_t u = v * (uint64_t)01001001001;
uint64_t x = u & (uint64_t)042104210421;
v = x % 017;
std::cout << "i: " << std::bitset<8>(i) << ", u: " << std::bitset<33>(u) <<
", x: " << std::bitset<33>(x) << ", v: " << v << std::endl;

Gives:

i: 11101011
u: 101011011101011011101011011101011
x: 100010001000000010001000000000001
v: 6

Due to the very low number of absolute instructions (despite the expensive idiv function, the count of instructions is what matters in my usage case), I'd like to use this or a similar function. But I don't quite understand how the modulus 15 works.

I only need to count up to 7 bits (though 8 would be ideal.) What would be the best way to fix this function?

Solution

In the following I am assuming 8-bit a. The original HAKMEM code was likely designed for a machine with a 36-bit word, common at the time of its creation.

The problem is that the code as-is misses the accumulation of bit 5 of a which maps to bit 32 of the product, which isn't representable in a 32-bit machine. At the same time, bit 8 of the product goes unused. So we can isolate bit 5 of a and move it to be bit 8 of the product. Then mask the lowest bit in every nibble, and sum the nibbles by multiplication, so the sum winds up in the highest nibble. Resulting C code is shown below.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int reference_popc (uint32_t a)
{
    int res = 0;
    while (a) {
        a &= a - 1;
        res++;
    }
    return res;
}

// based on HAKMEM item 167
int hakmem_popc_byte (uint8_t a)
{
    int r;
    r = (((((uint32_t)a * 01001001001) | ((a & 0x20) << 3)) & 0x11111111) * 0x11111111) >> 28;
    return r;
}

int main (void)
{
    uint8_t a = 0;
    do {
        if (hakmem_popc_byte(a) != reference_popc (a)) {
            printf ("error @ %08x: res=%d  ref=%d\n", 
                    a, hakmem_popc_byte(a), reference_popc (a));
            return EXIT_FAILURE;
        }
        a = a + 1;
    } while (a);
    return EXIT_SUCCESS;
}

After looking some more at the bit pattern produced by the initial multiplication, I observed that we can do better than the above quick fix. The initial multiply sets bits 8, 17, and 26 to zero. To avoid hitting any of these while selecting every fourth bit by masking, we can use the mask 0x88888888. However, this then requires down shifting of the extracted data to avoid overflow in the most significant nibble during summing. The resulting code is:

// based on HAKMEM item 167
int hakmem_popc_byte (uint8_t a)
{
    int r;
    r = (((((uint32_t)a * 01001001001) & 0x88888888) >> 3) * 0x11111111) >> 28;
    return r;
}