Search code examples
c++cpu-cache

Couldn't Reproduce non-friendly C++ cache code


I'm playing with some chunks of code that intentionally tries to demonstrate ideas about CPU Cache, and how to get benefit of aligned data and so on.

From this article http://igoro.com/archive/gallery-of-processor-cache-effects/ which I find very interesting I'm trying to reproduce the behavior of these two loops (example 1):

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

Note the factor 16 in the second loop, as ints are 4 bytes long in my machine ( Intel i5), every element in the second loop will be in a different cache line, so theoretically and according to the author, these two loops will take almost the same time to execute, because the speed here is governed by the memory access, not by the integer multiplication instruction.

So I've tried to reproduce this behavior, with two different compilers, MSVC 19.31.31105.0 and GNU g++ 9.4.0 and the result is the the same on both cases, the second loop takes about 4% of the time with respect to the first loop. Here is the implementation I've used to reproduce it

#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>

#define SIZE 64 * 1024 * 1024
#define K 16

//Comment this to use C array instead of std vector
//#define USE_C_ARRAY 

int main() {

#if defined(USE_C_ARRAY)
    int* arr = new int[SIZE];
    std::cout << "Testing with C style array\n";
#else 
    std::vector<int> arr(SIZE);
    std::cout << "Using std::vector\n";
#endif

    std::cout << "Step:           " << K << " Elements \n";
    std::cout << "Array Elements: " << SIZE << "\n";
    std::cout << "Array Size:     " << SIZE/(1024*1024) << " MBs \n";
    std::cout << "Int Size:       " << sizeof(int) << "\n";

    auto show_duration = [&](auto title, auto duration){ 
        std::cout << "[" << title << "] Duration: " << 
            std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
            " (milisecs) \n";
    };

    auto start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i++) arr[i]++;
    show_duration("FullArray", 
                (std::chrono::steady_clock::now() - start));


    start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i += K) arr[i]++;

    show_duration(std::string("Array Step ") + std::to_string(K), 
                (std::chrono::steady_clock::now() - start));

#if defined(USE_C_ARRAY)
    delete[] arr;
#endif
    return EXIT_SUCCESS;
}

To compile it with g++:

 g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3                                                             

Note 1: I'm using -g3 to explicitly show that I'm avoiding any compiler optimization just to see the raw behavior.

Note 2: The difference in the assembly code generated for the two loops is the second argument of the loop counter add function, which adds 1 or K to i

add     DWORD PTR [rbp-4], 16 # When using a 16 Step 

Link to test in in Compiler Explorer: https://godbolt.org/z/eqWdce35z

It will be great if you have any suggestions or ideas about what I might be missing, thank you very much in advance!


Solution

  • Your mistake is not using any optimization. The overhead of the non-optimized code masks any cache effects.

    When I don't use any optimization, I get the following:

    $ g++ -O0 cache.cpp -o cache
    $ ./cache 
    Using std::vector
    Step:           16 Elements 
    Array Elements: 67108864
    Array Size:     64 MBs 
    Int Size:       4
    [FullArray] Duration: 105 (milisecs) 
    [Array Step 16] Duration: 23 (milisecs) 
    

    If I use an optimization, even the most conservative optimization flag, -O1, suddenly the result changes:

    $ g++ -O1 cache.cpp -o cache
    $ ./cache 
    Using std::vector
    Step:           16 Elements 
    Array Elements: 67108864
    Array Size:     64 MBs 
    Int Size:       4
    [FullArray] Duration: 26 (milisecs) 
    [Array Step 16] Duration: 22 (milisecs)
    

    (you can see that run time of step 16 has not changes much, because it was already bottlenecked by memory access, while run time of step 1 access improved dramatically, since it was limited by executing instructions, not memory bandwidth)

    Further raising optimization levels did not change results for me.

    You can compare the inner loops for different optimization levels. With -O0:

    vector_step_1(std::vector<int, std::allocator<int> >&):
            push    rbp
            mov     rbp, rsp
            sub     rsp, 32
            mov     QWORD PTR [rbp-24], rdi
            mov     DWORD PTR [rbp-4], 0
    .L7:
            cmp     DWORD PTR [rbp-4], 67108863
            jg      .L8
            mov     eax, DWORD PTR [rbp-4]
            movsx   rdx, eax
            mov     rax, QWORD PTR [rbp-24]
            mov     rsi, rdx
            mov     rdi, rax
            call    std::vector<int, std::allocator<int> >::operator[](unsigned long)
            mov     rdx, rax
            mov     ecx, DWORD PTR [rdx]
            mov     eax, ecx
            add     eax, eax
            add     eax, ecx
            mov     DWORD PTR [rdx], eax
            add     DWORD PTR [rbp-4], 1
            jmp     .L7
    .L8:
            nop
            leave
            ret
    

    You can see it makes a whole function call on every iteration.

    With -O1:

    vector_step_1(std::vector<int, std::allocator<int> >&):
            mov     eax, 0
    .L5:
            mov     rdx, rax
            add     rdx, QWORD PTR [rdi]
            mov     ecx, DWORD PTR [rdx]
            lea     ecx, [rcx+rcx*2]
            mov     DWORD PTR [rdx], ecx
            add     rax, 4
            cmp     rax, 268435456
            jne     .L5
            ret
    

    Now, there are just 8 instructions in the loop. At -O2 it becomes 6 instructions per loop.