Search code examples
cpu-architecturebranch-prediction

TAGE prediction accuracy improves with loop over larger array?


The code snippet iterates through a 1D matrix. (N is the size of the matrix).

for (i=0; i< N; i++) // outer loop for Rows

When I run this piece of code on a processor simulator to measure TAGE accuracy, I realize that as the size of the array (N) increases, the TAGE accuracy increases.

What is the reason for this?


Solution

  • Loop branches typically mispredict only on the last iteration when execution falls through out of the loop instead of jumping to the top. (For fairly obvious reasons: they quickly learn that the branch is always-taken, and predict that way.)

    The more iterations your loops run, the more correctly-predicted taken branches you have for the same number of not-taken special cases that mispredict.


    Fun fact: on modern Intel CPUs (like Haswell / Skylake), their IT-TAGE branch predictors can "learn" a pattern up to about 22 iterations, correctly predicting the loop exit. With a very long outer loop to give the CPU time to learn the pattern, an inner loop that runs only 22 or fewer iterations tends to predict even the loop-exit branches correctly. So there's a significant dropoff in performance (and instruction throughput) when the inner-loop size grows past that point, if the loop body is pretty simple.

    But it probably takes quite a few outer-loop iterations to train the predictors with that much history. I was testing 10 million outer-loop iterations or so, to average out noise and startup overhead for a whole process with perf stat on real hardware under Linux. So the startup / learning phase was negligible.

    With older simpler branch predictors (before TAGE), I think some CPUs did implement loop-pattern prediction with a counter to predict loop exits for inner loops that ran a constant number of iterations every time they were reached. https://danluu.com/branch-prediction/ says the same, that "modern CPUs" "often" have such predictors.