I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?
Here is my program, the access_array()
traverses through array with different step size, and I measure the execution time for these step size.
// Program to calculate L1 cache line size, compile in g++ -O1
#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;
#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
}
}
int main(int argc, char** argv){
double cpu_us_used;
struct timeval start, end;
for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
return 0;
}
Result
My question is:
From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?
Here are my assumptions.
For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.
As a result, there is a gap between step = 32 and 64.
By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.
For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.
The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.
Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.
Platform : Intel Xeon(R) CPU E5-2667 0 @ 2.90GHz
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 262144
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 15728640
LEVEL3_CACHE_ASSOC 20
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0
Finally I found the answer.
I tried to set the array size to 16KiB and smaller, but it slow on step = 4KiB too.
On the other hand, I tried to change the offset of steps from mutiplying 2 in each iteration to add 1 in each iteration, it still slow down when step = 4KiB.
Code
#define ARRAY_SIZE (4200)
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = idx + steps;
if(idx >= ARRAY_SIZE)
idx = 0;
}
}
for(int step = 4090 ; step <= 4100 ; step ++){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
Result
4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149
As a result, I suspected it's not related to any of cache / TLB / prefetch mechanism.
With more and more googling the relation ship between performance and magic number "4K", I have found the 4K aliasing problem on Intel Platform, which slows down the performance of load.
This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.