Efficiently compute the hashCode for a BitSet-like implementation of Set<Integer>

I wonder, how to efficiently compute the hashCode for a BitSet-like implementation of Set<Integer>.

The BitSet#hashCode is obviously fast to compute, rather stupid(*) and incompatible with Set#hashCode().

A fast compatible implementation could go like

int hashCode() {
    int result = 0;
    for (int i=0; i<bits.length; ++i) {
        long word = bits[i];
        result += 64 * i * Long.bitCount(word) + weightedBitCount(word);
    }
    return result;
}

if there was an efficient implementation of

int weightedBitCount(long word) { // naive implementation
    int result = 0;
    for (int i=0; i<64; ++i) {
        if ((word & (1L << i)) != 0) {
            result += i;
        }
    }
    return result;
}

In case most bits are unset, the naive implementation could be improved by testing word==0 or using Long.highestOneBit or alike, but these tricks don't help much and are detrimental in other cases.

Is there a clever trick to significantly speed it up in general?

I guess, some batch computation over multiple words could be more efficient. Computing Set#size at the same time would be a nice bonus.

A note concerning premature optimization: Yes, I know. I'm mostly curious (and it can be useful for things like Project Euler).

(*) There are many bits which gets completely ignored (they get shifted out in the multiplication).

Solution

This is what I did:

int weightedBitCount(long word) {
       return (Long.bitCount(word & 0xFFFF_FFFF_0000_0000L) << 5)
            + (Long.bitCount(word & 0xFFFF_0000_FFFF_0000L) << 4)
            + (Long.bitCount(word & 0xFF00_FF00_FF00_FF00L) << 3)
            + (Long.bitCount(word & 0xF0F0_F0F0_F0F0_F0F0L) << 2)
            + (Long.bitCount(word & 0xCCCC_CCCC_CCCC_CCCCL) << 1)
            + (Long.bitCount(word & 0xAAAA_AAAA_AAAA_AAAAL) << 0);
}

It's pretty simple: With a single bit set, e.g., the bit 10, word looks like 0x0000_0000_0000_0400L and only the masks 0xFF00_FF00_FF00_FF00L and 0xCCCC_CCCC_CCCC_CCCCL produce a bit count of 1, so we get

(0 << 5) + (0 << 4) + (1 << 3) + (0 << 2) + (1 << 1) + (0 << 5) = 10

It needs some 6*4 instructions (maybe 6 cycles on modern Intel) per 64 bits, so it's not really slow, but it's still too slow when compared with the bulk bitset operations which need a single instruction (per 64 bits).

So I'm playing with some batch computation over multiple words.