Search code examples
cgccbit-manipulationintrinsicsinstruction-set

How to enable instrinsic functions from the preprocessor


I can find the n'th set bit of a 16-bit value by using a lookup table but this can't be done for a 32-bit value, without breaking it up and using several LUTs. This is discussed in How to efficiently find the n-th set bit? which suggests a use similar to this

#include <immintrin.h>
//...
unsigned i = 0x6B5;                 // 11010110101
unsigned n = 4;                     //    ^
unsigned j = _pdep_u32(1u << n, i);
int bitnum = __builtin_ctz(j));     // result is bit 7

I have profiled this suggested loop method

j = i;
for (k = 0; k < n; k++)
    j &= j - 1;
return (__builtin_ctz(j));

against the two variants of the bit-twiddling code in an answer from Nominal Animal. The branching variant of the twiddle was the fastest of the three code versions, not by much. However in my actual code, the above snippet using __builtin_ctz was faster. There are other answers such as from fuz who suggests what I found too: the bit-twiddles take a similar time as the loop method.

So I now want to try using _pdep_u32 but am unable to get _pdep_u32 recognised. I read in gcc.gnu.org

The following built-in functions are available when -mbmi2 is used.
...
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
...

However I am using an online compiler and don't have access to the options.

Is there a way to enable options with the preprocessor?

There is plenty of material about controlling the preprocessor from command line options, but I can't find how to do it the other way round.

Ultimately I want to use the 64-bit version _pdep_u64.


Solution

  • Other than inline asm, gcc/clang will never emit an asm instruction that isn't enabled via a command line option, function attribute, or pragma.

    If you can't use GCC target options on the command line like a normal person (e.g. -march=native to enable everything supported by the compile host, and tune for that machine), you can instead use a pragma to set target-specific options. It doesn't work very well, though; it seems #pragma target("arch=skylake") breaks GCC's immintrin.h (if you put the pragma before the include), apparently trying to compile AVX512 functions without AVX512 enabled. The GCC docs do show examples of using "arch=core2" as a function attribute, or even "sse4.1,arch=core2"

    You can set target("bmi2,tune=skylake") though without breaking anything (before or after #include <immintrin.h>. (unlike arch=, tune= doesn't enable ISA extensions, just affects code-gen choices.)

    #include <immintrin.h>
    // #pragma GCC target ("arch=haswell")  // also broken, disables BMI2??
    
    #pragma GCC target ("bmi2,tune=skylake") // this can be after immintrin.h
    
    int foo(unsigned x) {
      // unsigned i = 0x6B5;                 // 11010110101
      unsigned i = x;
      unsigned n = 4;                        //    ^
      unsigned j = _pdep_u32(1u << n, i);
      int bitnum = __builtin_ctz(j);     // result is bit 7
      return bitnum;
    }
    
    int constprop() {  // check that GCC can still "understand" the intrinsic
      return foo(0x6B5);
    }
    

    compiles cleanly on Godbolt with only -O3, no -m options.

    foo:
            mov     r8d, edi
            mov     edi, 16
            pdep    edi, edi, r8d
            bsf     eax, edi
            ret
    
    constprop:
            mov     eax, 7
            ret
    

    If you ever have trouble getting immintrin.h to define the right wrapper, you can look in GCC's headers to find out how they're defined. e.g. _pdep_u32 is a wrapper for __builtin_ia32_pdep_si (single integer) and u64 is a wrapper for __builtin_ia32_pdep_di (double integer).

    These __builtin functions are always defined by GCC even with no headers, but these ia32 builtins can only be used in functions with compatible target options. There isn't a fallback like with __builtin_popcount for targets that don't support the instruction.

    But it seems that for BMI2 instructions at least, _pdep_u32 and _pdep_u64 get defined by GCC's immintrin.h regardless of command line options (which would define a __BMI2__ CPP macro). Perhaps the reason is this use case: single functions with a target attribute, or a block of functions with a pragma.

    The _pdep_u64 version requires BMI2 + x86-64 while the 32-bit version only requires BMI2 (i.e. is available in 32-bit mode as well). pdep is a scalar integer instruction so 64-bit operand-size requires 64-bit mode.


    Aside: MSVC compiled the instrinsic without fuss

    Yeah, MSVC and GCC/clang have different philosophies about intrinsics. MSVC lets you use whatever, under the assumption that you'll do runtime dispatching to make sure the execution never reaches an intrinsic that isn't supported on this machine.

    This might be related to MSVC essentially not optimizing intrinsics at all, often not even doing simple constant-propagation through them. Because it's trying to make sure the corresponding asm instruction never executes when the intrinsic isn't reached in the C++ abstract machine, I guess, and takes the slow and safe approach.

    but the program crashed (I suppose _pdep_u32 is not supported by my cpu as suggested in the linked question).

    Yup, not all CPUs have BMI2. Nothing you can do with GCC will change that.