Other than certain normal specified conditions where interrupts are not delivered to the virtual processor (cli, if=0, etc), are all instructions in the guest actually interruptible?
That is to say, when an incoming hardware interrupt is given to the LAPIC then to the processor, supposedly some internal magic happens to translate that to a virtual interrupt to the guest (using virtual APIC, no exiting). When that happens, does the currently executing instruction immediately serialize the OOO stream and jump to the vector like a typical interrupt delivery, or does VT-x's virtual interrupt delivery cause some other kind of resolution to take place?
The context is that it's often very valuable to stress test race conditions and synchronization primitives using an emulator. One would hope your emulator would be capable of receiving an interrupt at any instruction, to trigger "interesting behavior".
This leads to the question, does VT-x virtualization provide the same instruction level granularity of interrupts such that "interesting behavior" would be triggered the same as a pure instruction emulator?
The Intel SDM does note that virtual interrupts are delivered on instruction boundaries, but there's still some questions as to whether or not all boundaries normally valid on the chip are always still checking for interrupts in VT-x mode.
I don't see any reason why being inside the guest would be special wrt. what happens when an external interrupt arrives. (Assuming we're talking about the guest code running natively on the pipeline, not the root emulating and/or deciding to delay re-entry to the guest in anticipation of another interrupt. See comments.)
They're not going to effectively block interrupts for multiple instructions; that would hurt interrupt latency. Since there aren't special "sync points" where interrupts can only be delivered there, the pipeline needs to be able to handle interrupts between arbitrary instructions. Out-of-order exec always potentially has a lot going on, so you can't count on waiting for any specific state before handling an interrupt; that could take too long. If the gap between one pair of instructions is ok to deliver an interrupt, why not any other?
CPUs don't rename the privilege level, so yes they'd roll back to the retirement state, discard all in-flight instructions in the back-end, and then figure out what to do based on the current state. See also When an interrupt occurs, what happens to instructions in the pipeline?
This completely untested guess is based on my understanding of CPU architecture. If there was a measurable effect on interrupt latency, that might be a real thing.
In practice, regardless of VT-X, some instruction-boundaries will probably be impossible to interrupt unless single-stepping.
Retirement bandwidth is 3 per clock (Nehalem), 4 per clock per logical thread (Haswell), or even higher in Skylake. Retirement from the out-of-order core is usually bursty, because it happens in-order (to support precise exceptions), this is why we have a ROB separate from the Reservation Station.
It's very common for one instruction to block retirement of later independent instructions for a while, and then for a burst of retirement along with that instruction. e.g. a cache-miss load, or the end of a long dep chain right before some independent instructions.
So for some functions or blocks of code, it's likely that every single time they run, an xor
-zeroing instruction for example always retires in the same cycle as the instruction before. That means the CPU is never in a state where the xor-zeroing instruction is the oldest non-retired instruction, and thus gap between it and the insn before can never be the place where an interrupt appears.
If you have 2 instructions closely following each other, e.g. one coming in a cycle after the CPU returns to user-space from an earlier instruction, you could end up with front-end effects at 64-byte I-cache boundaries disturbing the usual pattern of cheap independent instructions like nop
or xor
-zeroing always retiring in the same cycle as an earlier higher-latency instruction, but there might still be cases that were non-disturbable where fetch and 5-wide decode, and 4-wide issue/rename will reliably get instructions into the pipe together without opportunity for having the slow one finish before the fast one after it is ready to retire.
As I said, this isn't specific to VT-x.