Search code examples
c++gccassemblybitset

How to speed up bit testing


I'm pondering at how to speed up bit testing in the following routine:

void histSubtractFromBits(uint64* cursor, uint16* hist){
    //traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
    std::bitset<64> a(*cursor);
    std::bitset<64> b(*(cursor+1));
    std::bitset<64> c(*(cursor+2));
    std::bitset<64> d(*(cursor+3));
    for(int bit = 0; bit < 64; bit++){
        hist[bit] -= a.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+64] -= b.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+128] -= c.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+192] -= d.test(bit);
    }
}

The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).

I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.

1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?

2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?


Solution

  • As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:

    uint64_t a = *cursor;
    for(int bit = 0; a != 0; bit++, a >>= 1) {
        hist[bit] -= (a & 1);
    }
    

    Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll

    uint64_t a = *cursor;
    int next;
    for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
        bit += next;
        hist[bit - 1] -= 1;
        a >>= next;
    }
    

    The idea should be fine, but no warranty for the actual code :)

    Update: code using vector extensions:

    typedef short v8hi __attribute__ ((vector_size (16)));
    
    static v8hi table[256];
    
    void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
    {
        uint8_t* cursor_tmp = (uint8_t*)cursor;
        v8hi* hist_tmp = (v8hi*)hist;
        for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
        {
            *hist_tmp -= table[*cursor_tmp];
        }
    }
    
    void setup_table()
    {
        for(int i = 0; i < 256; i++)
        {
            for(int j = 0; j < 8; j++)
            {
                table[i][j] = (i >> j) & 1;
            }
        }
    }
    

    This will be compiled to SSE instructions if available, for example I get:

            leaq    32(%rdi), %rdx
            .p2align 4,,10
            .p2align 3
    .L2:
            movzbl  (%rdi), %eax
            addq    $1, %rdi
            movdqa  (%rsi), %xmm0
            salq    $4, %rax
            psubw   table(%rax), %xmm0
            movdqa  %xmm0, (%rsi)
            addq    $16, %rsi
            cmpq    %rdx, %rdi
            jne     .L2
    

    Of course this approach relies on the table being in cache.