Why do MWAIT Power Management hints cause premature wakeups?

For university I'm currently experimenting with the MONITOR/MWAIT instruction pair. Specifically, I want to measure how much energy the CPU uses in different scenarios and have already programmed a relatively well working test setup. As part of the setup, I have all the cores enter MWAIT and then use a NMI to wake them up again after a specified time. So far everything was working fine, but now I wanted to test how the Power Management hints affect the power consumption.

Unfortunately, every hint apart from 0 seems to cause MWAIT to not wait for the NMI, but to wake up on its' own after 3-4 ms. As far as I understand the documentation, the Power Management hints should not have any impact on when execution is continued after the MWAIT, so this is quite strange. And since I still haven't made any progress even after spending a few hours on this problem, I thought maybe someone here has some idea what is going on!

Here is how I use MONITOR/MWAIT in my code:

volatile int dummy;

void do_mwait() {
    asm volatile("monitor;" ::"a"(&dummy), "c"(0), "d"(0));
    asm volatile("mwait;" ::"a"(0x10), "c"(0));
}

This is obviously just a small excerpt of the Linux kernel module I've written, but should contain all the important points. dummy is a variable that is never used outside of what you can see here. It only exists so that I have a valid address to pass to monitor. do_mwait() is the function that gets executed on every core available while I do my measurements. As I said, just exchanging the 0x10 in the second line of do_mwait() with 0 makes it work the way I expect.

Because the behaviour and supported features of MONITOR/MWAIT depend on the specific CPU model, here are all the relevant (I think) parts of cpuid on my test machine. As far as I see, all necessary features should be supported:

CPU 0:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = 0x6 (6)
      model           = 0xc (12)
      stepping id     = 0x3 (3)
      extended family = 0x0 (0)
      extended model  = 0x3 (3)
      (family synth)  = 0x6 (6)
      (model synth)   = 0x3c (60)
      (simple synth)  = Intel Core (unknown type) (Haswell C0) {Haswell}, 22nm
   ...
   feature information (1/ecx):
      ...
      MONITOR/MWAIT                           = true
      ...
   ...
   MONITOR/MWAIT (5):
      smallest monitor-line size (bytes)       = 0x40 (64)
      largest monitor-line size (bytes)        = 0x40 (64)
      enum of Monitor-MWAIT exts supported     = true
      supports intrs as break-event for MWAIT  = true
      number of C0 sub C-states using MWAIT    = 0x0 (0)
      number of C1 sub C-states using MWAIT    = 0x2 (2)
      number of C2 sub C-states using MWAIT    = 0x1 (1)
      number of C3 sub C-states using MWAIT    = 0x2 (2)
      number of C4 sub C-states using MWAIT    = 0x4 (4)
      number of C5 sub C-states using MWAIT    = 0x0 (0)
      number of C6 sub C-states using MWAIT    = 0x0 (0)
      number of C7 sub C-states using MWAIT    = 0x0 (0)
   ...
   brand = "Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz"
   ...

I hope this is enough context. Please tell me if I need to share additional information. And thanks in advance for any input, even if it's just an (educated) guess!

Solution

From Intel's manual (https://www.felixcloutier.com/x86/mwait)

The following cause the processor to exit the implementation-dependent-optimized state: a store to the address range armed by the MONITOR instruction, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause the processor to exit the implementation-dependent-optimized state.

...

Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation-dependent-optimized state even if interrupts are masked and ECX[0] = 0.

This might be an example of either of those.

Obviously that's not very satisfying, and it would be nice if I knew where to look to find any microarchitectural explanation of why that might only happen with deeper sleep states.

Like perhaps in C1 state (EAX=0), enough is still powered on to check the interrupt mask, but in deeper sleep states even some of the internal interrupt controller stuff is powered down?

You'd hope that they could check without powering up the whole core and resuming execution, but maybe that's something a microcode update enabled as a workaround for some design problem that was discovered later. Intel's Haswell errata list (https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf) does mention some C-state related problems where "the BIOS can contain a workaround", which actually means the BIOS can include a microcode update that changes CPU behaviour. Intel often includes ways for microcode to disable certain optimizations or features in case problems are found, and perhaps the only way to fix some lockup in an odd corner case was to always have the core wake from C2 or deeper on interrupts, even when they're supposed to be masked.

That's pure guesswork as to the cause, but Intel does clearly document that what you're seeing is possible.

Other possibilities include SMI (system-management-mode interrupts), but hopefully your system doesn't fire those regularly.