I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.
For example
input: 0x00000001, 0x10000000
output: 0, 28
In C++20, #include <bit>
and use std::countr_zero(x)
(cppreference).
Compile with options that let or encourage compilers to use tzcnt
, such as -march=x86-64-v3
For earlier C++, and what's efficient in asm, see the rest of this answer.
#ifdef __GNUC__
, use __builtin_ctz(unsigned)
to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010
has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)
For 64-bit integers, use __builtin_ctzll(unsigned long long)
. It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned
is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long
is always uint64_t
on all GNU C targets I'm aware of.
On x86, it will compile to bsf
or tzcnt
depending on tuning + target options. tzcnt
is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567
(on Intel CPUs, bsf
/tzcnt
has the same cost as imul r, r, imm
, in front-end uops, back-end port, and latency.)
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf
.
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32
from immintrin.h
. (Intel intrinsics guide). Or you might need to include intrin.h
(MSVC) or x86intrin.h
for non-SIMD intrinsics.
Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.
MSVC also has _BitScanForward
/ _BitScanReverse
for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.
TZCNT decodes as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf
. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt
because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(The situation is less convenient for bsr
vs. lzcnt
: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt
)
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr
is the same in this case as bsf = tzcnt
. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
Related:
ffs()
which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.Compilers do recognize ffs()
and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.