c++performance bit-manipulation c++20 cpu-architecture

The most efficient way to test if a positive integer is 2^n (i.e. 1, 2, 4, 8, etc.) in C++20?

A handy method to verify if a positive integer n is a power of two (like 1, 2, 4, 8, etc.) is to use the following test for having no more than 1 bit set:

bool test = n & (n - 1) == 0;

This operation can be very efficient because it only involves subtraction, a bitwise AND and a conditional branch on the Zero Flag (ZF). If this expression is evaluated to true, then the number n is indeed a power of two.

Another method uses the std::popcount (population count) function, which is part of the C++20 standard library, for the test:

bool test = std::popcount(n) == 1; // (Since C++20)

This function counts the number of set bits (1s) in. If the hardware supports a popcount instruction (POPCNT), this function can be very fast.

In C++, you generally “pay for what you use”. For this test there is no use for counting.

What is the better method, in terms of CPU efficiency?

Solution

You know your number is positive (which excludes zero) so you can indeed just use n & (n-1) == 0 without checking n != 0. That's your most efficient option, potentially more efficient than C++20 std::has_single_bit

std::has_single_bit needs to rule out the no-bits-set case. For that, it can be slightly more efficient on modern x86 to do popcount(n) == 1 if the compiler can assume support for a hardware popcnt instruction, which is why std::has_single_bit is often defined that way in C++ standard libraries.

But since you know your number is non-zero, the bithack is most efficient. Especially if compiling for a target where the compiler can't assume a hardware popcount (like x86 without -march=x86-64-v2 or newer), or AArch64 before ARMv9.x where scalar popcount requires copying to vector regs and back. RISC-V only has hardware popcount in an uncommon extension, not baseline.

On x86, it can be as cheap as

  lea   eax, [rdi-1]
  test  eax, edi
   # boolean condition in ZF
  jz  or setz  or cmovz  or whatever

And similar on AArch64 with sub and tst. And pretty much any other modern RISC can subtract 1 while putting the result into a separate register, then AND those together cheaply.

And if you're compiling for -march=x86-64-v3 or later (BMI2 + AVX2 + FMA = Haswell feature set), the compiler can use BMI1 blsr eax, edi to clear ("reset") the lowest set bit and set FLAGS accordingly, with ZF set according to the output being zero. CF is set if the input was zero so some branch conditions can check that n!=0. But unfortunately conditions like jbe are CF=1 or ZF=1, ja is CF=0 and ZF=0. There isn't a single FLAGS condition that checks for ZF=1 and CF=0 which would let std::has_single_bit compile to just blsr plus a single branch, cmov, or setcc. ja is taken if the input had multiple bits set, not-taken for power-of-2 or zero.

Unlike test, blsr can't macro-fuse with a later jcc so it doesn't save any uops vs. lea/test if you're branching on it. It is better on Intel if the compiler is using it branchlessly, like for setnz or cmovnz. blsr is a single uop on Intel, and on AMD Zen 4 and later. 2 uops on Zen 3 and earlier (https://uops.info/)

Non-popcount way to exclude zero

For use-cases where you can't assume a non-zero input and don't have cheap hardware popcount, there's an alternate bithack that's 3 operations instead of two: (n - 1) < (n ^ (n - 1))

The right hand side is what x86 BMI1 blsmsk computes, but if we have blsmsk we have popcnt.
n-1 is a common subexpression so we only need to compute it once. For example RISC-V; AArch64 could be similar with cmp/bltu

 addi  x1, x0, -1             # n-1
 xor   x2, x1, x0             # n ^ (n-1)
 bltu  x1, x2, power_of_two   # branch on a comparison

For zero, n-1 is UINT_MAX, and n ^ anything is a no-op, so both sides are equal.

For a power of two, n-1 sets all the bits below where the set bit was, and XOR sets that bit again. So it's larger than n, and also larger than n-1.

For a non-power-of-two, n ^ (n-1) is still just a mask up to and including the lowest set bit, with the high bits cancelled (the ones that n-1 didn't flip). So it's smaller than n-1.

https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2 also suggests v && !(v & (v - 1)); but I don't think that's better since logical && has to check both sides for non-zero.