Search code examples
x86cpu-architecturebranch-prediction

Is there automatic L1i cache prefetching on x86?


I looked at the wiki article on branch target predictor; it's somewhat confusing:

I thought the branch target predictor comes into play when a CPU decides which instruction(s) to fetch next (into the CPU pipeline to execute).

But the article mentions some points like this:

Instruction cache fetches block of instructions

Instructions in block are scanned to identify branches

So, does the instruction cache (== L1i I imagine) (pre)fetch instructions based on some branch target prediction data?..

Or is it just that the article implies something other that x86... well, or I misunderstand something


Solution

  • In the Itanium (not x86 but Intel), there was L1i prefetch and in fact there were L1I_PREFETCH_MISS_RATIO, L1I_PREFETCHES, L2_INST_PREFETCHES, ... performance monitoring events. However, I'm not seeing any L1I prefetch events for Haswell or Skylake. ITLB yes but not L1I. If there was L1I prefetching going on then there would be performance monitoring events measuring this for something like VTune.

    You didn't ask for which microarchitecture but I think the lack of performance monitoring events for Haswell+Skylake strongly implies that there is no I-cache prefetching going on for Intel x86_64 cpus in general, only what's actually triggered by the fetch stage, using addresses generated by branch prediction.

    There is significant buffering between fetch and decode in recent x86 CPUs, and between decode and rename/allocate into the back end. (See Kanter's Haswell writeup and Skylake on wikichips). So the fetch stage and the front-end in general run far enough ahead of execution to serve a similar purpose to the L1d HW prefetchers for load/store data, but driven by branch prediction instead of sequential access patterns.

    Much of the hardware prefetch logic in Intel CPUs is in the L2 cache, which is unified code/data. (And it does look for sequential access patterns). L2 hit latency is low enough not to be a big deal, given the buffering in the pipeline.