Search code examples
optimizationintegerbit-manipulationmathematical-optimizationunsigned

Which is the fastest way to find the last N bits of an integer?


Which algorithm is fastest for returning the last n bits in an unsigned integer?

1.

return num & ((1 << bits) - 1)

2.

return num % (1 << bits)

3.

let shift = num.bitWidth - bits
return (num << shift) >> shift

(where bitWidth is the width of the integer, in bits)

Or is there another, faster algorithm?


Solution

  • This is going to depend heavily on what compiler you have, what the optimization settings are, and what size of integers you're working with.

    My hypothesis going into this section was that the answer would be "the compiler will be smart enough to optimize all of these in a way that's better than whatever you'd choose to write." And in some sense, that's correct. Consider the following three pieces of code:

    #include <stdint.h>
    #include <limits.h>
    
    uint32_t lastBitsOf_v1(uint32_t number, uint32_t howManyBits) {
        return number & ((1 << howManyBits) - 1);
    }
    
    uint32_t lastBitsOf_v2(uint32_t number, uint32_t howManyBits) {
        return number % (1 << howManyBits);
    }
    
    uint32_t lastBitsOf_v3(uint32_t number, uint32_t howManyBits) {
        uint32_t shift = sizeof(number) * CHAR_BIT - howManyBits;
        return (number << shift) >> shift;
    }
    

    Over at the godbolt compiler explorer with optimization turned up to -Ofast with -march=native enabled, we get this code generated for the three functions:

    lastBitsOf_v1(unsigned int, unsigned int):
            bzhi    eax, edi, esi
            ret
    
    lastBitsOf_v2(unsigned int, unsigned int):
            bzhi    eax, edi, esi
            ret
    
    lastBitsOf_v3(unsigned int, unsigned int):
            mov     eax, 32
            sub     eax, esi
            shlx    edi, edi, eax
            shrx    eax, edi, eax
            ret
    

    Notice that the compiler recognized what you were trying to do with the first two versions of this function and completely rewrote the code to use the bzhi x86 instruction. This instruction copies the lower bits of one register into another. In other words, the compiler was able to generate a single assembly instruction! On the other hand, the compiler didn't recognize what the last version was trying to do, so it actually generated the code as written and actually did the shifts and subtraction.

    But that's not the end of the story. Imagine that the number of bits to extract is known in advance. For example, suppose we want the lower 13 bits. Now, watch what happens with this code:

    #include <stdint.h>
    #include <limits.h>
    
    uint32_t lastBitsOf_v1(uint32_t number) {
        return number & ((1 << 13) - 1);
    }
    
    uint32_t lastBitsOf_v2(uint32_t number) {
        return number % (1 << 13);
    }
    
    uint32_t lastBitsOf_v3(uint32_t number) {
        return (number << 19) >> 19;
    }
    

    These are literally the same functions, just with the bit amount hardcoded. Now look at what gets generated:

    lastBitsOf_v1(unsigned int):
            mov     eax, edi
            and     eax, 8191
            ret
    lastBitsOf_v2(unsigned int):
            mov     eax, edi
            and     eax, 8191
            ret
    lastBitsOf_v3(unsigned int):
            mov     eax, edi
            and     eax, 8191
            ret
    

    All three versions get compiled to the exact same code. The compiler saw what we're doing in each case and replaced it with this much simpler code that's basically the first version.

    After seeing all of this, what should you do? My recommendation would be the following:

    1. Unless this code is an absolute performance bottleneck - as in, you've measured your code's runtime and you're absolutely certain that the code for extracting the low bits of numbers is what's actually slowing you down - I wouldn't worry too much about this at all. Pick the most readable code that you can. I personally find option (1) the cleanest, but that's just me.

    2. If you absolutely must get every ounce of performance out of this that you can, rather than taking my word for it, I'd recommend tinkering around with different versions of the code and seeing what assembly gets generated in each case and running some performance experiments. After all, if something like this is really important, you'd want to see it for yourself!

    Hope this helps!