Search code examples
c++performanceassemblygcccpu-architecture

Significant speed difference between two seemingly-equivalent methods of calculating prefix sums with GCC for x86-64


I tried two almost identical methods of calculating prefix sums, and found that they have significant differences after being compiled. The compilation option is -O2.

Different compilation results led to a 4 times difference in their running time.

The first:

#include <numeric>
#include <algorithm>

int main() {
    unsigned a[5000];
    std::iota(a, a + 5000, 0);
    for (int k = 0; k < 1'000'000; k++)
        for (int i = 1; i < 5000; i++)
            a[i] += a[i - 1];
    return *std::min_element(a, a + 5000);
}

The second:

#include <numeric>
#include <algorithm>

int main() {
    unsigned a[5000];
    std::iota(a, a + 5000, 0);
    for (int k = 0; k < 1'000'000; k++)
        for (int i = 0; i + 1 < 5000; i++)
            a[i + 1] += a[i];
    return *std::min_element(a, a + 5000);
}

In Compiler Explorer

What could be the reason for this exception in the compiler?


Solution

  • In version 2 we see that GCC has done this,

    .L4:
            add     edx, DWORD PTR [rax]
            add     rax, 4
            mov     DWORD PTR [rax-4], edx
            cmp     rcx, rax
            jne     .L4
    

    So accumulating values in edx, adding a new element into it and storing the accumulated sum to memory. This is fine, no problems here.

    In version 1 we see that GCC has done this,

    .L4:
            mov     edx, DWORD PTR [rax-4]
            add     DWORD PTR [rax], edx
            add     rax, 4
            cmp     rax, rcx
            jne     .L4
    

    Here the previously stored partial sum is loaded again, and then it is added into the current element.

    That is not good, there is a loop-carried dependency through memory, putting the latency of a store+reload on the critical path. On various CPUs that may be 5 or 6 cycles per iteration. In version 2 the similar dependency goes through edx instead of through memory, which is pretty efficient on every CPU, letting the loop execute at closer to 1.5 cycles per iteration. Based on that effect, a performance difference of around 4x is expected.

    On some newer CPUs it may not work out so badly, if the CPU can use memory renaming (zero-latency store to load forwarding) in this situation. However, about Intel Ice Lake, Agner Fog writes:

    The fast forwarding does not work in the following cases, according to my tests:
    ...

    • Read-modify-write instructions

    We do have a read-modify-write instruction here so that's probably not good.

    For AMD-style memory renaming, apparently the memory operand has to be an exact match (not just the computed address, but the way in which it is specified), which we don't have here, so it's probably not good either.

    I don't know why GCC would compile the two versions in such different ways. Maybe there is a reason and I just couldn't think of it, but maybe it just happened by bad luck.