performance assembly gcc compiler-optimization icc

What compiler commands can be used to make GCC and ICC compile programs as fast as each other?

enter image description here Code asm diff

I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?

Solution

I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.

Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.

ICC is defeating your benchmark repeat-loop by turning it into this

    for (unsigned c = 0; c < arraySize; ++c)
    {
        if (data[c] >= 128)
            for (unsigned i = 0; i < 100000; ++i)
                sum += data[c];
    }

The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.

Turning for() if() into if() for(){} else for(){} is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop was if()sum+=..., so it becomes just an if controlling a repeated-addition loop.

ICC unrolls + vectorizes that sum +=, strangely not just doing it with a multiply. Instead it does 100000 64-bit add operations. ymm0 holds _mm256_set1_epi64(data[c]) from vpbroadcastq.

This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)

..B1.9:                         # Preds ..B1.9 ..B1.8
        add       edx, 16                                       #20.2
        vpaddq    ymm4, ymm4, ymm0                              #26.5
        vpaddq    ymm3, ymm3, ymm0                              #26.5
        vpaddq    ymm2, ymm2, ymm0                              #26.5
        vpaddq    ymm1, ymm1, ymm0                              #26.5
        cmp       edx, 100000                                   #20.2
        jb        ..B1.9        # Prob 99%                      #20.2

Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3 vpaddq per clock cycle.

By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:

.L85:
        vmovdqa ymm4, YMMWORD PTR [rax]      # load 8 ints
        add     rax, 32
        vpcmpgtd        ymm0, ymm4, ymm3     # signed-compare them against 128
        vpand   ymm0, ymm0, ymm4             # mask data[c] to 0 or data[c]
        vpmovsxdq       ymm2, xmm0           # sign-extend to 64-bit
        vextracti128    xmm0, ymm0, 0x1
        vpaddq  ymm1, ymm2, ymm1             # and add
        vpmovsxdq       ymm0, xmm0           # sign-extend the high half and add it
        vpaddq  ymm1, ymm0, ymm1             # to the same accumulator
        cmp     rbx, rax
        jne     .L85

This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.

So it just vectorized the inner if() sum+=data[c] like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?