x86 intel cpu-architecture branch-prediction spectre

The inner workings of Spectre (v2)

I have done some reading about Spectre v2 and obviously you get the non technical explanations. Peter Cordes has a more in-depth explanation but it doesn't fully address a few details. Note: I have never performed a Spectre v2 attack so I do not have hands on experience. I have only read up about about the theory.

My understanding of Spectre v2 is that you make an indirect branch mispredict for instance if (input < data.size). If the Indirect Target Array (which I'm not too sure of the details of -- i.e. why it is separate from the BTB structure) -- which is rechecked at decode for RIPs of indirect branches -- does not contain a prediction then it will insert the new jump RIP (branch execution will eventually insert the target RIP of the branch), but for now it does not know the target RIP of the jump so any form of static prediction will not work. My understanding is it is always going to predict not taken for new indirect branches and when Port 6 eventually works out the jump target RIP and prediction it will roll back using the BOB and update the ITA with the correct jump address and then update the local and global branch history registers and the saturating counters accordingly.

The hacker needs to train the saturating counters to always predict taken which, I imagine, they do by running the if(input < data.size) multiple times in a loop where input is set to something that is indeed less than data.size (catching errors accordingly) and on the final iteration of the loop, make input more than data.size (1000 for instance); the indirect branch will be predicted taken and it will jump to the body of the if statement where the cache load takes place.

The if statement contains secret = data[1000] (A particular memory address (data[1000]) that contains secret data is targeted for loading from memory to cache) then this will be allocated to the load buffer speculatively. The preceding indirect branch is still in the branch execution unit and waiting to complete.

I believe the premise is that the load needs to be executed (assigned a line fill buffer) before the load buffers are flushed on the misprediction. If it has been assigned a line fill buffer already then nothing can be done. It makes sense that there isn't a mechanism to cancel a line fill buffer allocation because the line fill buffer would have to pend before storing to the cache after returning it to the load buffer. This could cause line fill buffers to become saturated because instead of deallocating when required (keeping it in there for speed of other loads to the same address but deallocating when the there are no other available line buffers). It would not be able to deallocate until it receives some signal that a flush is not going to occur, meaning it has to halt for the previous branch to execute instead of immediately making the line fill buffer available for the stores of the other logical core. This signalling mechanism could be difficult to implement and perhaps it didn't cross their minds (pre-Spectre thinking) and it would also introduce delay in the event that branch execution takes enough time for hanging line fill buffers to cause a performance impact i.e. if data.size is purposefully flushed from the cache (CLFLUSH) before the final iteration of the loop meaning branch execution could take up to 100 cycles.

I hope my thinking is correct but I'm not 100% sure. If anyone has anything to add or correct then please do.

Solution

Sometimes the term "BTB" is used collectively to refer to all of the buffers used by the branch prediction unit. However, there are actually multiple buffers all of which are used in every cycle to make target and direction predictions. In particular, the BTB is used to make predictions for direct branches, the ITB (indirect target buffer) is used to make predictions for indirect branches except for returns, and the RSB is used to make predictions for returns. The ITB is also called the IBTB or Indirect Target Array. All of these terms are used by different vendors and researchers. Typically, the BTB is used to make initial predictions for all kinds of branch instructions when the other buffers miss. But later the predictor learns more about the branches and the other buffers come into play. If multiple dynamic instances of the same indirect branch have all the same target, then the BTB might also be used instead of the ITB. The ITB is much more accurate when the same branch has multiple targets and is designed specifically to deal with such branches. See: Branch prediction and the performance of interpreters — Don't trust folklore. The first Intel processor that implemented separate BTB and ITB structures is the Pentium M. All later Intel Core processors have dedicated ITBs.

The Spectre V1 exploit is based on training the BTB using an attacker program so that when the victim executes a branch that aliases the same BTB entry, the processor is tricked into speculatively executing instructions (called the gadget) to leak information. The Spectre V2 exploit is similar but is based on training the ITB instead. The crucial difference here is that in V1, the processor mispredicts the direction of the branch, while in V2, the processor mispredicts the target of the branch (and, in case of a conditional indirect branch, the direction as well because we want it to be taken). In programs that are interpreted, JIT-compiled, or make use of dynamic polymorphism, there can be many indirect branches (other than returns). A particular indirect branch may never be intended to go to some location, but by mistraining the predictor, it can be made to jump anywhere we want. It is exactly for this reason why V2 is very powerful; no matter where the gadget is and no matter what the intentional control flows of the program are, you can pick one of the indirect branches and make it speculatively jump to the gadget.

Note that typically the linear address of the target of a static direct branch remains the same throughout the lifetime of the program. There is only one situation where this may not be the case: dynamic code modification. So at least in theory, a Spectre exploit can be developed based on target misprediction of direct branches.

Regarding reclamation of LFBs, I don't really understand what you're saying. When a load request that missed the L1D receives the data into the LFB, the data is immediately forwarded to the bypass interconnect of the pipeline. There needs to be a way to determine which load uop has requested this data. The data returned must be tagged with the uop ID of the load. The sources of the uops in the RS that are waiting for the data are represented as the uop IDs of the loads. In addition, the ROB entry that holds the load uop needs to be marked as completed so that it can be retired and, in pre-SnB, the returned data needs to be written into the ROB. If, on pipeline flush, an outstanding load request in an LFB is not cancelled, and if the load uop ID got reused for some other uop, when the data arrives, it might be incorrectly forwarded to whatever new uops are currently in the pipeline, thereby corrupting the microarchitectural state. So there needs to be a way to ensure that this does not happen under any circumstances. It's very possible to cancel outstanding load requests and speculative RFOs on a pipeline flush by simple marking all of the valid LFB entries as "cancelled", just so that the data is not returned to the pipeline. However, the data might still be fetched and filled in into one or more levels of caches. Requests in the LFB are identified by line-aligned physical addresses. There can be other possible designs.

I've decided to run an experiment to determine exactly when the LFBs get deallocated on Haswell. Here is how it works:

Outer Loop (10K iterations):

Inner Loop (100 iterations):
10 load instructions to different cache lines most of which miss the L2.
LFENCE.
A sequence of IMULs to delay the resolution of the jump by 18 cycles.
Jump to inner.

3 load instructions to different cache lines.
LFENCE.
Jump to outer.

For this to work, hyperthreading and both L1 prefetchers need to be turned off to ensure that we own all of the 10 LFBs of the L1.

The LFENCE instructions ensure that we don't run out of LFBs when executing on a correctly predicted path. The key idea here is that the inner jump will be mispredicted once per outer iteration, so up to 10 loads of the inner iteration that are on the mispredicted path can be allocated in the LFBs. Note that the LFENCE prevents loads from later iterations to be allocated. After a few cycles, the inner branch will be resolved and a misprediction occurs. The pipeline is cleared and the frontend is resteered to fetch and execute the load instructions in the outer loop.

There are two possible outcomes:

The LFBs that have been allocated for the loads on the mispredicted path are immediately released as part of the pipeline clear operation and made available for other loads. In this case, there will be no stalls due to LFB unavailability (counted using L1D_PEND_MISS.FB_FULL).
The LFBs are released only when the loads get serviced irrespective of whether they were on a mispredicted path.

When there are three loads in the outer loop after the inner jump, the measured value of L1D_PEND_MISS.FB_FULL is about equal to the number of outer iterations. That's one request per outer loop iteration. This means that when the three loads on the correct path get issued to the L1D, the loads from the mispredcited path are still occupying the 8 LFB entries, resulting in an FB full event for the third load. This suggests that loads in the LFBs only get deallcoated when the load actually completes.

If I put less than two loads in the outer loop, there will be basically no FB full events. There is one thing I noticed: for every additional load in the outer loop beyond three loads, the L1D_PEND_MISS.FB_FULL gets increased by about 20K instead of the expected 10K. I think what's happening is that when a load request of a load uop gets issued to the L1D for the first time and the all LFBs are in use, it gets rejected. Then when an LFB becomes available, two loads pending in the load buffer are sent to the L1D, one will be allocated in the LFB and the other will get rejected. So we get two LFB full events per additional load. However, when there are three loads in the outer loop, only the third one would be waiting for an LFB, so we get one event per outer loop iteration. Essentially, the load buffer cannot distinguish between having one LFB available or two LFBs; it only gets to know that at least one LFB is free and so it attempts to send two load requests at the same time since there are two load ports.