x86 cpu-architecture intel micro-architecture

Are any instructions affected by IA32_UARCH_MISC_CTL[DOITM] in existing CPUs?

In the document titled Data Operand Independent Timing Instruction Set Architecture (ISA) Guidance Intel is introducing a new IA32_UARCH_MISC_CTL MSR where toggling bit 0 enables the "Data Operand Independent Timing Mode" (DOITM). This MSR is available on Intel Core code-named Ice Lake, Atom code-named Gracemont, and newer CPUs (i.e. Alder Lake at the time of writing). Processors before Ice Lake and Gracemont behave as if DOIT mode is always enabled.

On Intel Core CPUs, integer instructions that do not involve complex microcode are usually understood to have fixed latency with respect to data operands, but not address operands (integer division appears to run with fixed latency since Ice Lake, and is microcoded before that [uops.info]). Floating-point instructions may be different when subnormals are involved. This seems natural, as the CPU backend needs to compute availability of results ahead of time.

In the DOIT document Intel seems to be saying that they are envisioning further CPU optimizations, where some instructions may have shorter latencies depending on their data operands. The exact nature of optimizations is not disclosed, except for the confusing phrase:

for example, enabling data operand independent timing might disable data-dependent prefetching

which is hard to interpret as applying to data operands, but not address operands.

A specific example of an instruction that could meaningfully have different latency depending on data operand is the IMUL instruction: when multiplying by an operand that is known at register renaming time to be zero (because it was previously zeroed at rename stage by the xor same, same idiom), then the result can be resolved to be zero at the rename stage as well, for zero execution latency instead of three or four cycles. A similar technique could be applied to many basic single-cycle ALU operations (e.g. resolving the result of an ADD/OR/XOR with a renamed-to-zero operand to the second operand).

In DOIT mode some, but not all, instructions are guaranteed to have fixed latency with respect to data operands, just like on pre-Ice Lake CPUs. Such instructions are enumerated in the accompanying document titled Data Operand Independent Timing Instructions. Somewhat confusingly, the list includes instructions like LDDQU and POP that do not have any data operands.

Does toggling the DOIT mode bit actually change anything on today's CPUs? Are there any instructions among those listed in the second document that already behave differently depending on the content of a data operand (or whether it is renamed to the zero register)?

Solution

There are no instructions on current Intel CPUs where the DOIT mode bit affects execution latency with respect to data operands.

In a message to LKML Intel employee Dave Hansen provides extra background for this feature:

The execution latency of the DOIT instructions[1] does not depend on the value of data operands on all currently-supported Intel processors. This includes all processors that enumerate DOITM support. There are no plans for any processors where this behavior would change, despite the DOITM architecture theoretically allowing it.

So, what's the point of DOITM in the first place? Fixed execution latency does not mean that programs as a whole will have constant overall latency. DOITM currently affects features which do not affect execution latency but may, for instance, affect overall program latency due to side-effects of prefetching on the cache. Even with fixed instruction execution latency, these side-effects can matter especially to the paranoid.

Today, those affected features are:

Data Dependent Prefetchers (DDP)[2]

Some Fast Store Forwarding Predictors (FSFP)[3].

Essentially that means that DOIT mode, contrary to its name and documentation, currently affects memory dependencies via address operands, but has no effect on data operands. But latency of memory dependencies was never expected to be fixed in the first place (and is not fixed in practice), so Intel's messaging seems really confusing here.

And what matters for optimizations that open side channels, like "Data-dependent prefetching", is not the timing, but whether they happen at all. What matters to the attacker in the DDP case is the observable effect on the cache.