I have a program that makes heavy use of the intrinsic command _BitScanForward
/ _BitScanForward64
(aka count trailing zeros, TZCNT, CTZ).
I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).
When using gcc or clang (where the intrinsic is called __builtin_ctz
), I can achieve this by specifying either -march=haswell
or -mbmi2
as compiler flags.
The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.
I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.
Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?
UPDATE
Here is what it looks like with Godbolt. Please be nice, my assembly reading skills are pretty basic.
GCC uses tzcnt
with haswell/bmi2, otherwise resorts to rep bsf
.
MSVC uses bsf
without rep
.
I also found this useful answer, which states that:
bsf
?bsf
is not the same as tzcnt
, however MSVC doesn't appear to check for input == 0This adds the questions: Why does bsf
work for MSVC?
UPDATE
Okay, this was easy, I actually call _BitScanForward
for MSVC. Doh!
UPDATE
So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt
, but that doesn't exist in MSVC so I resorted to _BitScanForward
plus an extra check to account for 0
input.
However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).
Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?
Answer: see here. Specifically: "On Intel processors that don't support the lzcnt
instruction, the instruction byte encoding is executed as bsr
(bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse
intrinsic instead."
The article suggests to resort to bsr
if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu
and then call either bsr
or lzcnt
.
In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).
There's no way to target specifically Haswell, but there's a way to assume AVX2 availability (/arch:AVX2
), which also assumes tzcnt
availability and other BMI1 and BMI2 instructions, along with FMA3. And there's a way to tune for a wide set of architectures (/favor
option; unfortunately only three wide groups available: common Intel, common AMD, Intel Atom).
There's no reason why _BitScanReverse
wouldn't generate tzcnt
under /arch:AVX2
, except missed optimization opportunity. For example _mm_set1_*
intrinsics do generate different code under different /arch
options. Even the intrinsics explicit named are not always used to generate the instruction they imply. Like _mm_load_si128
may be no-op (fused with a subsequent op using memory operand instead of a register). Or _mm_extract_pi16
with zero argument may be just movd
.
I've created Developer Community issue about this missed optimization. For now, you can use <bit>
functions, which do runtime detection by default, but use lzcnt
/ tzcnt
unconditionally with /arch:AVX2
.
(There's virtually no way to avoid using AVX2 and BMI instructions with /arch:AVX2
in an optimized build, as there's always some room for auto-vectorization, so _BitScanReverse
cannot be "portable to legacy processors" with /arch:AVX2
and above anyway)