c++assembly x86 bit-manipulation intrinsics

Find position of the unique set bit in 32-bit number

I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.

For example

input:    0x00000001, 0x10000000
output:            0,         28

Solution

In C++20, #include <bit> and use std::countr_zero(x) (cppreference).
Compile with options that let or encourage compilers to use tzcnt, such as -march=x86-64-v3

For earlier C++, and what's efficient in asm, see the rest of this answer.

#ifdef __GNUC__, use __builtin_ctz(unsigned) to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)

Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010 has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)

For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.

On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567 (on Intel CPUs, bsf/tzcnt has the same cost as imul r, r, imm, in front-end uops, back-end port, and latency.)

The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.

In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.

Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.

MSVC also has _BitScanForward / _BitScanReverse for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.

VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
_BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
How to use MSVC intrinsics to get the equivalent of this GCC code?

TZCNT decodes as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).

(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)

Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr is the same in this case as bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.

Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.

Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.