I have two arrays: A
with N_A
random integers and B
with N_B
random integers between 0
and (N_A - 1)
. I use the numbers in B
as indices into A
in the following loop:
for(i = 0; i < N_B; i++) {
sum += A[B[i]];
}
Experimenting on an Intel i7-3770, N_A
= 256 million, N_B
= 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds.
As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?
The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this:
load b[0]
load a[b[0]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
...
The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle.
I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain:
int lastLoad = 0;
for(i = 0; i < N_B; i++) {
var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger
sum += load;
lastLoad = load;
}
This requires the last load to be executed until the address of the next load can be computed.