Search code examples
assemblyx86-64avx2bmi

Do all CPUs that support AVX2 also support BMI2 or popcnt?


From here, I learned that the support of AVX doesn't imply the support of BMI1. So how about AVX2: Do all CPUs that support AVX2 also support BMI2? Further, does the support of AVX2 imply the support of popcnt?

Searched all over Google and cannot locate a definite answer. The closest thing I got is Does AVX support imply BMI1 support?.


Solution

  • You should check for all the CPU features you actually depend on just in case of future weird CPUs or VMs, or (unlikely) features disabled due to CPU bugs and microcode updates. But if you're wondering whether to write two AVX2 versions of your function, one with and one without BMI1/2 instructions: no unless it's with/without pdep/pext. Checking for BMI2 as well won't stop any real CPUs from running your AVX2 version.

    All real hardware with AVX2 has also had BMI2

    AMD Zen 2 and earlier have unusably slow pdep/pext, so you'll want to check for those CPU models instead of availability of BMI2 if you're doing CPU detection to set up function pointers, for functions that use either instruction inside loops. Other BMI2 instructions are fine if supported.

    Almost all AVX2 hardware has FMA as well, but not quite1.

    BMI1/2 and FMA3 are part of the -march=x86-64-v3 feature level (essentially Haswell, but without TSX, AES-NI, rdrand and some other stuff. https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).

    MSVC's /arch:AVX2 is like GCC/Clang -march=x86-64-v3, also enabling FMA3 and BMI1/2.


    It's fairly likely all future CPUs will have both AVX2+BMI2, or neither, at least in commercially-relevant mainstream CPUs, although pdep and pext do need a significant amount of transistors for an execution unit separate from anything else needed for any other instruction. (A bitwise version of AVX-512 vpcompressb/vpexpandb.) Or slow microcode.

    AVX2 and BMI2 have separate feature bits so an emulator or VM could disable BMI2 while leaving AVX2 enabled, so it's a good idea to check both. (And that the OS has enabled AVX: xgetbv after using CPUID to check that xgetbv is supported). An emulator might even fault if you try to run BMI2 instructions (unlike a VM: there's no control-register bit that will make the CPU hardware fault on BMI2 instructions it normally supports, unlike SSE/AVX/AVX-512.)

    You don't need a separate AVX2-without-BMI2 version of your functions, unless you wanted to use pdep/pext inside a loop. If someone sets up a weird emulator or VM that stops your code from using its AVX2 functions because it lacks BMI2, that's their problem, and is unlikely to happen by accident.

    CPUs so far

    • Intel Haswell: introduced AVX2 and BMI2. (Also Intel's first BMI1 CPU).
    • Intel Gracemont (Alder Lake E-cores): AVX2 and BMI2. First low-power silvermont-family with AVX1 or BMI1.
    • AMD Excavator: AMD's first AVX2 CPU was also their first BMI2 CPU. (With horribly slow microcoded pdep / pext)
    • AMD Zen 3: the first AMD with usable pdep / pext (same as Intel, 1 uop with 3c latency, 1c throughput).
    • VIA Nano C QuadCore C4650 (Isiah) from 2015: AVX2 + BMI2. (Notably without FMA31). I think this was VIA's first AVX2 CPU.
    • ZHAOXIN KaiXian ZX-C+ C4580: AVX2 + BMI2 (slow pdep / pext, but maybe not as bad as AMD? InstLatx64 doesn't say what inputs they tested with, and this might just be a very special case like 0). Based on VIA Nano C.
    • Centaur CNS: AVX512, AVX2, BMI2 (fast pdep/pext)

    Unusably slow pdep / pext on AMD Zen 2 and earlier

    AMD before Zen 3 (so Excavator, Zen 1, and Zen 2) have disastrously slow pdep and pext where the number of uops depends on the data, e.g. https://uops.info/ measured 64-bit pext at 133 uops on Zen 1&2 with one per 52 cycle throughput.

    All other BMI/BMI2 instructions are fast on CPUs that support them, at most 2 uops for stuff like blsr on AMD before Zen 4, or single-uop on Intel.

    See also What is a fast fallback algorithm which emulates PDEP and PEXT in software? re: options for fallbacks. If you were using it with a constant mask as a way to avoid some shift/OR work, just don't unless you also make a version tuned for AVX2-without-fast-pdep for such CPUs, or if you don't care much about non-current CPUs. (e.g. you know what cloud servers you'll run on.)


    AVX1 implies popcnt

    AVX1 implies SSE4.2, and SSE4.2 at least de-facto implies popcnt.

    popcnt does have its own feature bit so CPUs can have popcnt without SSE4.2 support, but in practice the opposite hasn't happened. And enough software assumes that SSE4.2 implies popcnt that if a CPU violated that assumption, it would be the CPUs fault, not software. It's not really a plausible situation; popcnt is cheap to implement compared to SSE4.2 string instructions.


    Footnote 1: Mysticial commented

    The VIA Isaiah C4650 has AVX2 but not FMA3. Breaks a lot of programs that assume FMA3 in the presence of AVX2

    Btw, I spoke to one of the VIA architects at Hot Chips about it. And he was pissed that they they allowed that to happen. IIRC, he hinted that they should've either turned off the CPUID for AVX2 or microcoded the FMA.