Search code examples
c++visual-c++x86compiler-optimizationintrinsics

How to specify target CPU/architecture Haswell for MSVC Visual Studio?


I have a program that makes heavy use of the intrinsic command _BitScanForward / _BitScanForward64 (aka count trailing zeros, TZCNT, CTZ). I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).

When using gcc or clang (where the intrinsic is called __builtin_ctz), I can achieve this by specifying either -march=haswell or -mbmi2 as compiler flags.

The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.

I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.

Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?

UPDATE

Here is what it looks like with Godbolt. Please be nice, my assembly reading skills are pretty basic.

GCC uses tzcnt with haswell/bmi2, otherwise resorts to rep bsf. MSVC uses bsf without rep.

I also found this useful answer, which states that:

  • "Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for bsf?
  • It explains (as I knew) that bsf is not the same as tzcnt, however MSVC doesn't appear to check for input == 0

This adds the questions: Why does bsf work for MSVC?

UPDATE

Okay, this was easy, I actually call _BitScanForward for MSVC. Doh!

UPDATE

So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt, but that doesn't exist in MSVC so I resorted to _BitScanForward plus an extra check to account for 0 input.

However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).

Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?

Answer: see here. Specifically: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."

The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu and then call either bsr or lzcnt.

In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).


Solution

  • There's no way to target specifically Haswell, but there's a way to assume AVX2 availability (/arch:AVX2), which also assumes tzcnt availability and other BMI1 and BMI2 instructions, along with FMA3. And there's a way to tune for a wide set of architectures (/favor option; unfortunately only three wide groups available: common Intel, common AMD, Intel Atom).

    There's no reason why _BitScanReverse wouldn't generate tzcnt under /arch:AVX2, except missed optimization opportunity. For example _mm_set1_* intrinsics do generate different code under different /arch options. Even the intrinsics explicit named are not always used to generate the instruction they imply. Like _mm_load_si128 may be no-op (fused with a subsequent op using memory operand instead of a register). Or _mm_extract_pi16 with zero argument may be just movd.

    I've created Developer Community issue about this missed optimization. For now, you can use <bit> functions, which do runtime detection by default, but use lzcnt / tzcnt unconditionally with /arch:AVX2.

    (There's virtually no way to avoid using AVX2 and BMI instructions with /arch:AVX2 in an optimized build, as there's always some room for auto-vectorization, so _BitScanReverse cannot be "portable to legacy processors" with /arch:AVX2 and above anyway)