I tried two almost identical methods of calculating prefix sums, and found that they have significant differences after being compiled. The compilation option is -O2
.
Different compilation results led to a 4 times difference in their running time.
The first:
#include <numeric>
#include <algorithm>
int main() {
unsigned a[5000];
std::iota(a, a + 5000, 0);
for (int k = 0; k < 1'000'000; k++)
for (int i = 1; i < 5000; i++)
a[i] += a[i - 1];
return *std::min_element(a, a + 5000);
}
The second:
#include <numeric>
#include <algorithm>
int main() {
unsigned a[5000];
std::iota(a, a + 5000, 0);
for (int k = 0; k < 1'000'000; k++)
for (int i = 0; i + 1 < 5000; i++)
a[i + 1] += a[i];
return *std::min_element(a, a + 5000);
}
What could be the reason for this exception in the compiler?
In version 2 we see that GCC has done this,
.L4:
add edx, DWORD PTR [rax]
add rax, 4
mov DWORD PTR [rax-4], edx
cmp rcx, rax
jne .L4
So accumulating values in edx
, adding a new element into it and storing the accumulated sum to memory. This is fine, no problems here.
In version 1 we see that GCC has done this,
.L4:
mov edx, DWORD PTR [rax-4]
add DWORD PTR [rax], edx
add rax, 4
cmp rax, rcx
jne .L4
Here the previously stored partial sum is loaded again, and then it is added into the current element.
That is not good, there is a loop-carried dependency through memory, putting the latency of a store+reload on the critical path. On various CPUs that may be 5 or 6 cycles per iteration. In version 2 the similar dependency goes through edx
instead of through memory, which is pretty efficient on every CPU, letting the loop execute at closer to 1.5 cycles per iteration. Based on that effect, a performance difference of around 4x is expected.
On some newer CPUs it may not work out so badly, if the CPU can use memory renaming (zero-latency store to load forwarding) in this situation. However, about Intel Ice Lake, Agner Fog writes:
The fast forwarding does not work in the following cases, according to my tests:
...
- Read-modify-write instructions
We do have a read-modify-write instruction here so that's probably not good.
For AMD-style memory renaming, apparently the memory operand has to be an exact match (not just the computed address, but the way in which it is specified), which we don't have here, so it's probably not good either.
I don't know why GCC would compile the two versions in such different ways. Maybe there is a reason and I just couldn't think of it, but maybe it just happened by bad luck.