For loop performance difference, and compiler optimization

I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.

Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.

I was experimenting with for-loops performance when I stumbled upon the following case.

I have the following code that does the same computation in two different ways.

#include <cstdint>
#include <chrono>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}

uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init + todo; i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
    double elapsed;

    std::printf("=====================================================\n");

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

Compiling this code with

g++ main.cpp -o output --std=c++11

leads to the following result:

=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

My first question is: why is the second loop running 23% faster than the first?

On the other hand, if I compile the code with

g++ main.cpp -o output --std=c++11 -O1

The results improve a lot,

=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

and the difference in time almost disappears.

But I could not believe my eyes when I set the -O2 flag,

g++ main.cpp -o output --std=c++11 -O2

and got this:

=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?

I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.

By the way, I am compiling this code with g++ (GCC) 4.9.1.

EDIT to confirm Basile Starynkevitch's assumption

I edited the code, now main looks like this:

int main(int argc, char **argv)
{
    int start = atoi(argv[1]);
    int end   = atoi(argv[2]);
    int delta = end - start + 1;

    std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
    double elapsed;

    std::printf("=====================================================\n");

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationB(start, delta);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationA(start, end);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
    std::printf("=====================================================\n");

    return 0;
}

These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.

I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.

Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.

Solution

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.

It appears that the performance difference in the -O0 case is due to processor pipelining.

First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:

superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # end
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L7
.L8:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # todo
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L11
.L12:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L11:
    movl    -20(%rbp), %edx    # copy init to register rdx
    movl    -24(%rbp), %eax    # copy todo to register rax
    addl    %edx, %eax         # [rax] += [rdx]  (so [rax] = init+todo)
    cmpl    -12(%rbp), %eax    # i < [rax]
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

In both functions, the stack layout looks like this:

Addr Content

24   end/todo
20   init
16   <empty>
12   i
08   total
04   
00   <base pointer>

(Note that total is a 64-bit int and so occupies two 4-byte slots.)

These are the key lines of superCalculationA():

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end

The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().

You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.

To demonstrate that this is just an artifact, let's replace

for (int i = init; i < end; i++)

with

for (int i = init; end > i; i++)

in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -24(%rbp), %eax    # copy end to register rax
    cmpl    -12(%rbp), %eax    # i < [rax]

Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:

=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.