I'm playing with some chunks of code that intentionally tries to demonstrate ideas about CPU Cache, and how to get benefit of aligned data and so on.
From this article http://igoro.com/archive/gallery-of-processor-cache-effects/ which I find very interesting I'm trying to reproduce the behavior of these two loops (example 1):
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
Note the factor 16 in the second loop, as ints are 4 bytes long in my machine ( Intel i5), every element in the second loop will be in a different cache line, so theoretically and according to the author, these two loops will take almost the same time to execute, because the speed here is governed by the memory access, not by the integer multiplication instruction.
So I've tried to reproduce this behavior, with two different compilers, MSVC 19.31.31105.0 and GNU g++ 9.4.0 and the result is the the same on both cases, the second loop takes about 4% of the time with respect to the first loop. Here is the implementation I've used to reproduce it
#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>
#define SIZE 64 * 1024 * 1024
#define K 16
//Comment this to use C array instead of std vector
//#define USE_C_ARRAY
int main() {
#if defined(USE_C_ARRAY)
int* arr = new int[SIZE];
std::cout << "Testing with C style array\n";
#else
std::vector<int> arr(SIZE);
std::cout << "Using std::vector\n";
#endif
std::cout << "Step: " << K << " Elements \n";
std::cout << "Array Elements: " << SIZE << "\n";
std::cout << "Array Size: " << SIZE/(1024*1024) << " MBs \n";
std::cout << "Int Size: " << sizeof(int) << "\n";
auto show_duration = [&](auto title, auto duration){
std::cout << "[" << title << "] Duration: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
" (milisecs) \n";
};
auto start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i++) arr[i]++;
show_duration("FullArray",
(std::chrono::steady_clock::now() - start));
start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i += K) arr[i]++;
show_duration(std::string("Array Step ") + std::to_string(K),
(std::chrono::steady_clock::now() - start));
#if defined(USE_C_ARRAY)
delete[] arr;
#endif
return EXIT_SUCCESS;
}
To compile it with g++:
g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3
Note 1: I'm using -g3 to explicitly show that I'm avoiding any compiler optimization just to see the raw behavior.
Note 2: The difference in the assembly code generated for the two loops is the second argument of the loop counter add function, which adds 1 or K to i
add DWORD PTR [rbp-4], 16 # When using a 16 Step
Link to test in in Compiler Explorer: https://godbolt.org/z/eqWdce35z
It will be great if you have any suggestions or ideas about what I might be missing, thank you very much in advance!
Your mistake is not using any optimization. The overhead of the non-optimized code masks any cache effects.
When I don't use any optimization, I get the following:
$ g++ -O0 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 105 (milisecs)
[Array Step 16] Duration: 23 (milisecs)
If I use an optimization, even the most conservative optimization flag, -O1
, suddenly the result changes:
$ g++ -O1 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 26 (milisecs)
[Array Step 16] Duration: 22 (milisecs)
(you can see that run time of step 16 has not changes much, because it was already bottlenecked by memory access, while run time of step 1 access improved dramatically, since it was limited by executing instructions, not memory bandwidth)
Further raising optimization levels did not change results for me.
You can compare the inner loops for different optimization levels. With -O0
:
vector_step_1(std::vector<int, std::allocator<int> >&):
push rbp
mov rbp, rsp
sub rsp, 32
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-4], 0
.L7:
cmp DWORD PTR [rbp-4], 67108863
jg .L8
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
mov rax, QWORD PTR [rbp-24]
mov rsi, rdx
mov rdi, rax
call std::vector<int, std::allocator<int> >::operator[](unsigned long)
mov rdx, rax
mov ecx, DWORD PTR [rdx]
mov eax, ecx
add eax, eax
add eax, ecx
mov DWORD PTR [rdx], eax
add DWORD PTR [rbp-4], 1
jmp .L7
.L8:
nop
leave
ret
You can see it makes a whole function call on every iteration.
With -O1
:
vector_step_1(std::vector<int, std::allocator<int> >&):
mov eax, 0
.L5:
mov rdx, rax
add rdx, QWORD PTR [rdi]
mov ecx, DWORD PTR [rdx]
lea ecx, [rcx+rcx*2]
mov DWORD PTR [rdx], ecx
add rax, 4
cmp rax, 268435456
jne .L5
ret
Now, there are just 8 instructions in the loop. At -O2
it becomes 6 instructions per loop.