enter image description here Code asm diff
I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?
I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.
Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.
ICC is defeating your benchmark repeat-loop by turning it into this
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
for (unsigned i = 0; i < 100000; ++i)
sum += data[c];
}
The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.
Turning for() if()
into if() for(){} else for(){}
is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop was if()sum+=...
, so it becomes just an if controlling a repeated-addition loop.
ICC unrolls + vectorizes that sum +=
, strangely not just doing it with a multiply. Instead it does 100000
64-bit add operations. ymm0 holds _mm256_set1_epi64(data[c])
from vpbroadcastq
.
This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)
..B1.9: # Preds ..B1.9 ..B1.8
add edx, 16 #20.2
vpaddq ymm4, ymm4, ymm0 #26.5
vpaddq ymm3, ymm3, ymm0 #26.5
vpaddq ymm2, ymm2, ymm0 #26.5
vpaddq ymm1, ymm1, ymm0 #26.5
cmp edx, 100000 #20.2
jb ..B1.9 # Prob 99% #20.2
Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3 vpaddq
per clock cycle.
By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:
.L85:
vmovdqa ymm4, YMMWORD PTR [rax] # load 8 ints
add rax, 32
vpcmpgtd ymm0, ymm4, ymm3 # signed-compare them against 128
vpand ymm0, ymm0, ymm4 # mask data[c] to 0 or data[c]
vpmovsxdq ymm2, xmm0 # sign-extend to 64-bit
vextracti128 xmm0, ymm0, 0x1
vpaddq ymm1, ymm2, ymm1 # and add
vpmovsxdq ymm0, xmm0 # sign-extend the high half and add it
vpaddq ymm1, ymm0, ymm1 # to the same accumulator
cmp rbx, rax
jne .L85
This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.
So it just vectorized the inner if() sum+=data[c]
like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?