performance memory-management benchmarking cpu-architecture cpu-cache

Understand a microbenchmark for Cache/RAM access latency

In this picture:pic

I don't really understand this plot. It basically shows the performance of reading and writing from different size array with different stride. Each color show different size of array. T know why it encrease but i don't know why it decrease?. So, for example for L (length of Array) = 64MB, and after stride=256k , why do you think time goes down again?

At this link, the code: www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps

Thank you.

Solution

The paper you posted tries to microbenchmark to find out the details of the Cray T3D. It reaches to the following conclusions:

The cache line / block size is 32B
The cache is direct mapped
There is no L2 cache (just an L1)
The page size is probably 8KB.

The next interesting remark is that they surround the most inner loop of the experiment with another loop to repeat the experiment. This loop actually enables caching.

Their code is the following:

for (arraySize = 4KB; arraySize < 8MB; arraySize *= 2)
   for (stride = 1; stride <= arraySize / 2; stride *= 2)
      for (i = 0; i < arraySize; i += stride)
         for (k = 0; k < repeats; k++)
            MEMORY_OPERATION_ON(A[i])

Let's take the arraySize = 4MB and strideSize = 1MB. You will access A[0], A[1M], A[2M], A[3M] each of them repeats times. That's only 4 addresses that will be cached easily.

So my theory is that for larger strides, the number of actual addresses being read / written / updated is smaller. This leads to 2 effects:

You have less cache misses because you reuse the data stored at less addresses
You have less TLB misses because you access less pages

I think that this is how the drops in latencies explains for larger strides.

For small strides the latency is low because you have higher chances of reading in the same cache block and there is also prefetching that might be more effective.

For medium strides you have enough reads to make caching ineffective, and because of the stride size you often get TLB misses because you jump through many pages. This gets to a higher latency.

My opinion is that this microbenchmarks should have been made on a fixed array size (8, 32 or 64MB) no matter the stride size, but large enough to reduce the caching effects. Higher the stride size higher the chances of reading the same address all over and over.