x86 cpu-cache microbenchmark cpuid micro-architecture

Will CPUID serialize speculative data caching?

I found the description of a speculative data caching procedure from multiple instruction entries in Intel Vol.2.

For example, the lfence:

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the LFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an LFENCE instruction.

Also, I found from online resources that the speculative caching will move data from farther cache to closer cache as well.

I want to know whether the strongest serializing instruction CPUID will prevent speculative caching across the barrier.

I've already searched the CPUID entry in Intel Vol.2 and the "serializing instruction" section in Intel Vol.3. But it shows nothing about speculative data caching.

Solution

LFENCE is already strong enough (in practice at least) to stop the CPU from actually looking at load instructions after it, but the CPU is free to speculatively load for other reasons.

Stopping that would require some kind of lookahead past the barrier to find out what addresses to disable HW prefetch for. That's not practical at all. CPUID or other serializing instructions aren't any stronger than LFENCE for stopping load prefetches.

The CPU is always allowed to speculatively fetch from memory in WB and WT regions / pages. Intel's optimization manual documents some stuff about the hardware prefetchers in some of their CPU models, so you could in practice avoid doing things before CPUID that are likely to trigger such prefetches.

(WC is weakly-ordered uncacheable+write-combining, but speculative fetch is also allowed there on paper. In real life that probably only happens in the shadow of a branch mispredict, not HW prefetch. It's not normally cacheable like WB and WT.)

If you're microbenchmarking a real CPU, the trick to some kinds of microbenchmarks is to find an access pattern that won't trigger HW prefetching, or to disable the HW prefetchers.

Maybe in theory you could have an x86 CPU that looked ahead in the instruction stream for load/store instructions and speculatively prefetched for them, separate from actually executing them (which Intel's definition of LFENCE would block). I don't think anything would stop it from doing that across CPUID either.

Probably nobody will design such a CPU, because

It's not worth the transistors / power. Starting prefetch as soon as regular out-of-order execution can get to it is already good enough. And except for absolute / RIP-relative addresses or direct jumps, you'd need register values from the OoO core to get a useful prefetch address.
Looking past LFENCE / CPUID is perverse; they're rare enough that defeating speculative "execution" of loads past them is part of the point, in the age of Spectre.