Search code examples
c++performanceoptimizationstaticstack

Why does accessing global static variable improve performance compared to stack variables?


I was trying to understand performance of global static variable and came across a very weird scenario. The code below takes about 525ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i++)
    {
        x += i + s_Data;
    }

    return 0;
}

and this code below takes 1050ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i++)
    {
        x += i;
    }

    return 0;
}

I am aware that accessing static variables are fast, and writing to them is slow based on my other tests but I am not sure what piece of information I am missing out in the above scenario. Note: compiler optimizations were turned off and MSVC compiler was used to perform the tests.


Solution

  • To address the actual question, with optimizations turned off, we can turn to the generated assembly to get an idea on why one runs more quickly than the other.

    In the first test, GCC (trunk) https://godbolt.org/z/GdssT9vME produces this assembly

    s_Data:
            .quad   1
    main:
            push    rbp
            mov     rbp, rsp
            mov     QWORD PTR [rbp-8], 0
            mov     DWORD PTR [rbp-12], 0
            jmp     .L2
    .L3:
            mov     eax, DWORD PTR [rbp-12]
            movsx   rdx, eax
            mov     rax, QWORD PTR s_Data[rip]
            add     rax, rdx
            add     QWORD PTR [rbp-8], rax
            add     DWORD PTR [rbp-12], 1
    .L2:
            cmp     DWORD PTR [rbp-12], 999999999
            jle     .L3
            mov     eax, 0
            pop     rbp
            ret
    

    The second test https://godbolt.org/z/5ndnEv5Ts we get

    main:
            push    rbp
            mov     rbp, rsp
            mov     QWORD PTR [rbp-8], 0
            mov     DWORD PTR [rbp-12], 0
            jmp     .L2
    .L3:
            mov     eax, DWORD PTR [rbp-12]
            cdqe
            add     QWORD PTR [rbp-8], rax
            add     DWORD PTR [rbp-12], 1
    .L2:
            cmp     DWORD PTR [rbp-12], 999999999
            jle     .L3
            mov     eax, 0
            pop     rbp
            ret
    

    Comparing these two programs, the first is sixteen instructions, while the second is only fourteen instructions. (I'm sure you can guess that different instructions also have different cpu cycle overheads)
    How many CPU cycles are needed for each assembly instruction?

    As noted in my comment, optimizations vastly change the generated assembly.
    Both tests produce this with -O2

    main:
            xor     eax, eax
            ret