Couldn't Reproduce non-friendly C++ cache code

I'm playing with some chunks of code that intentionally tries to demonstrate ideas about CPU Cache, and how to get benefit of aligned data and so on.

From this article http://igoro.com/archive/gallery-of-processor-cache-effects/ which I find very interesting I'm trying to reproduce the behavior of these two loops (example 1):

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

Note the factor 16 in the second loop, as ints are 4 bytes long in my machine ( Intel i5), every element in the second loop will be in a different cache line, so theoretically and according to the author, these two loops will take almost the same time to execute, because the speed here is governed by the memory access, not by the integer multiplication instruction.

So I've tried to reproduce this behavior, with two different compilers, MSVC 19.31.31105.0 and GNU g++ 9.4.0 and the result is the the same on both cases, the second loop takes about 4% of the time with respect to the first loop. Here is the implementation I've used to reproduce it

#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>

#define SIZE 64 * 1024 * 1024
#define K 16

//Comment this to use C array instead of std vector
//#define USE_C_ARRAY 

int main() {

#if defined(USE_C_ARRAY)
    int* arr = new int[SIZE];
    std::cout << "Testing with C style array\n";
#else 
    std::vector<int> arr(SIZE);
    std::cout << "Using std::vector\n";
#endif

    std::cout << "Step:           " << K << " Elements \n";
    std::cout << "Array Elements: " << SIZE << "\n";
    std::cout << "Array Size:     " << SIZE/(1024*1024) << " MBs \n";
    std::cout << "Int Size:       " << sizeof(int) << "\n";

    auto show_duration = [&](auto title, auto duration){ 
        std::cout << "[" << title << "] Duration: " << 
            std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
            " (milisecs) \n";
    };

    auto start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i++) arr[i]++;
    show_duration("FullArray", 
                (std::chrono::steady_clock::now() - start));


    start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i += K) arr[i]++;

    show_duration(std::string("Array Step ") + std::to_string(K), 
                (std::chrono::steady_clock::now() - start));

#if defined(USE_C_ARRAY)
    delete[] arr;
#endif
    return EXIT_SUCCESS;
}

To compile it with g++:

 g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3

Note 1: I'm using -g3 to explicitly show that I'm avoiding any compiler optimization just to see the raw behavior.

Note 2: The difference in the assembly code generated for the two loops is the second argument of the loop counter add function, which adds 1 or K to i

add     DWORD PTR [rbp-4], 16 # When using a 16 Step

Link to test in in Compiler Explorer: https://godbolt.org/z/eqWdce35z

It will be great if you have any suggestions or ideas about what I might be missing, thank you very much in advance!

Solution

Your mistake is not using any optimization. The overhead of the non-optimized code masks any cache effects.

When I don't use any optimization, I get the following:

$ g++ -O0 cache.cpp -o cache
$ ./cache 
Using std::vector
Step:           16 Elements 
Array Elements: 67108864
Array Size:     64 MBs 
Int Size:       4
[FullArray] Duration: 105 (milisecs) 
[Array Step 16] Duration: 23 (milisecs)

If I use an optimization, even the most conservative optimization flag, -O1, suddenly the result changes:

$ g++ -O1 cache.cpp -o cache
$ ./cache 
Using std::vector
Step:           16 Elements 
Array Elements: 67108864
Array Size:     64 MBs 
Int Size:       4
[FullArray] Duration: 26 (milisecs) 
[Array Step 16] Duration: 22 (milisecs)

(you can see that run time of step 16 has not changes much, because it was already bottlenecked by memory access, while run time of step 1 access improved dramatically, since it was limited by executing instructions, not memory bandwidth)

Further raising optimization levels did not change results for me.

You can compare the inner loops for different optimization levels. With -O0:

vector_step_1(std::vector<int, std::allocator<int> >&):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     QWORD PTR [rbp-24], rdi
        mov     DWORD PTR [rbp-4], 0
.L7:
        cmp     DWORD PTR [rbp-4], 67108863
        jg      .L8
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        mov     rax, QWORD PTR [rbp-24]
        mov     rsi, rdx
        mov     rdi, rax
        call    std::vector<int, std::allocator<int> >::operator[](unsigned long)
        mov     rdx, rax
        mov     ecx, DWORD PTR [rdx]
        mov     eax, ecx
        add     eax, eax
        add     eax, ecx
        mov     DWORD PTR [rdx], eax
        add     DWORD PTR [rbp-4], 1
        jmp     .L7
.L8:
        nop
        leave
        ret

You can see it makes a whole function call on every iteration.

With -O1:

vector_step_1(std::vector<int, std::allocator<int> >&):
        mov     eax, 0
.L5:
        mov     rdx, rax
        add     rdx, QWORD PTR [rdi]
        mov     ecx, DWORD PTR [rdx]
        lea     ecx, [rcx+rcx*2]
        mov     DWORD PTR [rdx], ecx
        add     rax, 4
        cmp     rax, 268435456
        jne     .L5
        ret

Now, there are just 8 instructions in the loop. At -O2 it becomes 6 instructions per loop.