Search code examples
bit-manipulationintrinsicsinteger-arithmetic

How to unset N right-most set bits


There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

Maybe there is some hardware support for it?


Solution

  • For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
    Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

    I'm not aware of other ISAs having a similar bit-packing hardware operation.

    (For the much simpler problem of simply the low n bits, x & (-1ULL<<n) if you don't need to handle n=64. Otherwise maybe bzhi against -1 / andn.)


    pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

    But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

    uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
        return _pdep_u64(-1ULL << n, a);
    }
    

    -1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

    On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

    GCC and clang both compile it the obvious way, as written:

    # GCC10.2 -O3 -march=skylake
    unset_first_n_bits_bmi2(unsigned long, int):
            mov     rax, -1
            shlx    rax, rax, rsi
            pdep    rax, rax, rdi
            ret
    

    (SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0. uops.info reports Alder Lake has 3 cycle latency for shlx, but that doesn't make much sense and is inconsistent with InstLatx64 measurements so I'm pretty sure SHLX/SHRX latency is still 1 cycle.)

    So this has 3 cycle latency from a->output (just pdep)
    and 4 cycle latency from n->output (shlx, pdep).

    And is only 3 uops for the front-end.


    A semi-related BMI2 trick:

    pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

    Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.


    Without fast PDEP: perhaps binary search for the right popcount

    @Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

    Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).

    Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.

    High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).

    Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.