Disabling AVX2 in CPU for testing purposes

I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to temporarly turn it off for testing purposes? Or to somehow emulate other CPU?

Solution

Yes, use an "emulation" (or dynamic recompilation) layer like Intel's Software Development Emulator (SDE), or maybe QEMU.

SDE is closed-source freeware, and very handy for both testing AVX512 code on old CPUs, or for simulating old CPUs to check that you don't accidentally execute instructions that are too new.

Example: I happened to have a binary that unconditionally uses an AVX2 vpmovzxwq load instruction (for a function I was testing). It runs fine on my Skylake CPU natively, but SDE has a -snb option to emulate a Sandybridge in both CPUID and actually checking every instruction.

 $ sde64 -snb -- ./mask
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (SANDYBRIDGE): 0x401005: vpmovzxwq ymm2, qword ptr [rip+0xff2]
Image: /tmp/mask+0x5 (in multi-region image, region# 1)
Instruction bytes are: c4 e2 7d 34 15 f2 0f 00 00

There are options to emulate CPUs as old as -quark, -p4 (SSE2), or Core 2 Merom (-mrm), to as new as IceLake-Server (-icx) or Tremont (-tnt). (And Xeon Phi CPUs like KNL and KNM.)

It runs pretty quickly, using dynamic recompilation (JIT) so code using only instructions that are supported natively can run at basically native speed, I think.

It also has instrumentation options (like -mix to dump the instruction mix), and options to control the JIT more closely. I think you could maybe get it to not report AVX2 in CPUID, but still let AVX2 instructions run without faulting.

Or probably emulate a CPU that supports AVX2 but not FMA (there is a real CPU like this from Via, unfortunately). Or combinations that no real CPU has, like AVX2 but not popcnt, or BMI1/BMI2 but not AVX. But I haven't looked into how to do that.

The basic sde -help options only let you set it to specific Intel CPUs, and for checking for potentially-slow SSE/AVX transitions (without correct vzeroupper usage). And a few other things.

One important test-case that SDE is missing is AVX+FMA without AVX2 (AMD Piledriver / Steamroller, i.e. most AMD FX-series CPUs). It's easy to forget and use an AVX2 shuffle in code that's supposed to be AVX1+FMA3, and some compilers (like MSVC) won't catch this at compile time the way gcc -march=bdver2 would. (Bulldozer only has AVX + FMA4, not FMA3, because Intel changed their plans after it was too late for AMD to redesign.)

If you just want CPUID to not report the presence of AVX2 (and FMA?) so your code uses its AVX1 or non-AVX versions of functions, you can do that with most VMs, but FMA and AVX2 instructions won't fault.

For AVX instructions to run without faulting, a bit in a control register has to be set. (So this works like a promise by the OS that it will correctly save/restore the new architectural state of YMM upper halves). Disabling AVX in CPUID will give you a VM instance where AVX instructions fault. (At least 256-bit instructions? I haven't tried this to see if 128-bit AVX instructions can still execute in this state on HW that supports AVX, but probably they fault, too.)

This mechanism for getting instructions to fault (not telling a guest OS in a VM about them so it doesn't set a control-register bit) only works for extensions that introduced new register state the OS needs to save/restore on context-switch: SSE1, AVX1, AVX-512, and AMX.

Disabling just AVX2 and leaving AVX enabled in a VM will let instructions like AVX2 vpermps run, even though CPUID doesn't report AVX2. That's the difference between using a hardware-accelerated VM vs. software emulation like SDE, or QEMU without KVM.

The upside is that everything runs at full performance, so a guest VM with some CPUID features disabled is useful for benchmarking your AVX1 or SSE4 code-paths, although actual older CPUs may be different, e.g. Sandybridge was slow with misaligned 256-bit loads, much bigger penalty than Haswell and later.

VMs will also work for stuff like testing your program executing correctly, other than checking for stray instructions that would fault on actual older CPUs. SDE or QEMU will also work for that use-case, so it's a matter of workflow convenience.