Are there compatibility issues with clang-cl and arch:avx2?

I'm using Windows 10, Visual Studio 2019, Platform: x64 and have the following test script in a single-file Visual Studio Solution:

#include <iostream>
#include <intrin.h>
using namespace std;

int main() {
    unsigned __int64 mask = 0x0fffffffffffffff; //1152921504606846975;
    unsigned long index;

    _BitScanReverse64(&index, mask);
    if (index != 59) {
        cout << "Fails!" << endl;
        return EXIT_FAILURE;
    }
    else {
        cout << "Success!" << endl;
        return EXIT_SUCCESS;
    }
}

In my property solution I've set the 'Enable Enhanced Instruction Set' to 'Advanced Vector Extenstions 2 (/arch:AVX2)'. When compiling with msvc (setting 'Platform Toolset' to 'Visual Studio 2019 (v142)') the code returns EXIT_SUCCESS, but when compiling with clang-cl (setting 'Platform Toolset' to 'LLVM (clang-cl)') I get EXIT_FAILURE. When debugging the clang-cl run, the value of index is 4, when it should be 59. This suggests to me that clang-cl is reading the bits in the opposite direction of MSVC.

This isn't the case when I set 'Enable Enhanced Instruction Set' to 'Not Set'. In this scenario, both MSVC and clang-cl return EXIT_SUCCESS.

All of the dlls are loaded and shown in the Debug Output window come from C:\Windows\System32###.dll in all cases.

Does anyone understand this behavior? I would appreciate any insight here.

EDIT: I failed to mention earlier: I compiled this with IntelCore i7-3930K CPU @3.20GHz.

Solution

Getting 4 instead of 59 sounds like clang implemented _BitScanReverse64 as 63 - lzcnt. Actual bsr is slow on AMD, so yes there are reasons why a compiler would want to compiler a BSR intrinsic to a different instruction.

But then you ran the executable on a computer that doesn't actually support BMI so lzcnt decoded as rep bsr = bsr, giving the leading-zero count instead of the bit-index of the highest set bit.

AFAIK, all CPUs that have AVX2 also have BMI. If your CPU doesn't have that, you shouldn't expect your executables build with /arch:AVX2 to run correctly on your CPU. And in this case the failure mode wasn't an illegal instruction, it was lzcnt running as bsr.

MSVC doesn't generally optimize intrinsics, apparently including this case, so it just uses bsr directly.

Update: i7-3930K is SandyBridge-E. It doesn't have AVX2, so that explains your results.

clang-cl doesn't error when you tell it to build an AVX2 executable on a non-AVX2 computer. The use-case for that would be compiling on one machine to create an executable to run on different machines.

It also doesn't add CPUID-checking code to your executable for you. If you want that, write it yourself. This is C++, it doesn't hold your hand.

target CPU options

MSVC-style /arch options are much more limited than normal GCC/clang style. There aren't any for different levels of SSE like SSE4.1; it jumps straight to AVX.

Also, /arch:AVX2 apparently implies BMI1/2, even though those are different instruction-sets with different CPUID feature bits. In kernel code for example you might want integer BMI instructions but not SIMD instructions that touch XMM/YMM registers.

clang -O3 -mavx2 would not also enable -mbmi. You normally would want that, but if you failed to also enable BMI then clang would have been stuck using bsr. (Which is actually better for Intel CPUs than 63-lzcnt). I think MSVC's /arch:AVX2 is something like -march=haswell, if it also enables FMA instructions.

And nothing in MSVC has any support for making binaries optimized to run on the computer you build them on. That makes sense, it's designed for a closed-source binary-distribution model of software development.

But GCC and clang have -march=native to enable all the instruction sets your computer supports. And also importantly, set tuning options appropriate for your computer. e.g. don't worry about making code that would be slow on an AMD CPU, or on older Intel, just make asm that's good for your CPU.

TL:DR: CPU selection options in clang-cl are very coarse, lumping non-SIMD extensions in with some level of AVX. That's why /arch:AVX2 enabled integer BMI extension, while clang -mavx2 would not have.