It it safe to assume that x64 builds can use TZCNT without checking its support through cpu flags?
No, certainly not! x86-64 was new in late 2003 (AMD K8), with only the legacy bsf
and bsr
bit-scan instructions, and none of the rest of BMI1.
The first Intel CPU to support BMI1 was Haswell in 2013. (Also introducing BMI2.)
The first AMD CPU to support BMI1 was Piledriver in 2012.
AMD ABM (Advanced Bit Manipulation) in K10 and later AMD CPUs only added popcnt
and lzcnt
, not tzcnt
.
Wikipedia Bit Manipulation Instruction Sets: Supporting CPUs. Note that Celeron/Pentium branded CPUs don't decode VEX prefixes, so they have AVX and BMI1/BMI2 disabled because BMI1 and 2 each include some VEX-coded instructions like andn
and blsr
. This sucks; BMI1/2 are most useful when compilers can use it everywhere throughout an executable for more efficient variable-count shifts, and peepholes, so still selling new CPUs without BMI1/2 is not getting us closer to being able to treat them as baseline like we do for P6 cmov
in 32-bit mode.
Since you mention tzcnt
specifically, its machine-code encoding is rep bsf
so older CPUs will execute it as BSF. This produces the same result as tzcnt
if the input is non-zero. i.e. tzcnt
"works" on all x86 CPUs (since 386) when the input is non-zero.
But when it is zero, tzcnt
would produce the operand-size (e.g. 64), but bsf
leaves the destination register unmodified. tzcnt
sets FLAGS based on the result, bsf
based on the input. AMD documents the dst-unmodified behaviour in their ISA reference manual. Intel only documents it as "undefined value" but implements the same behaviour as AMD, at least in existing CPUs.
(This is why bsf
/ bsr
have an output dependency on all CPUs, like add
. Unfortunately tzcnt
/ lzcnt
also have a false dependency on Intel Sandybridge-family before Skylake: Why does breaking the "output dependency" of LZCNT matter?. And why popcnt
does on SnB-family before Cannon / Ice Lake, because it shares the same execution unit.)
tzcnt
is significantly faster on AMD, so compilers tuning for "generic" or AMD CPUs will often use tzcnt
instead of bsf
without checking for CPU features.
e.g. for GNU C __builtin_ctz
. That intrinsic has undefined behaviour for input=0 so it's allowed to just use bsf
without checking for 0. And thus also allowed to use tzcnt
because the result in that case is not guaranteed by anything.
Why does TZCNT work for my Sandy Bridge processor?
No such backward / forward compat exists for lzcnt
. Having it decode as rep bsr
with the meaningless rep
prefix ignored would give you 31 - lzcnt(x)
, the bit-index. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/
One handy trick is ctz( x | 0x80000000 )
because OR is cheap1, and guarantees there's always a non-zero bit for bsf
to find. It doesn't change the result for any non-zero x
because it's the last bit bsf
will look at. This trick also works for __builtin_clz(x|1)
/ bsr
, where it's even better because or reg, imm8
is even shorter than imm32
.
Footnote 1: or reg, imm32
works for a 32-bit constant; bts reg,63
is less cheap on some CPUs to implement x|(1ULL<<63)
for a 64-bit input.