gcc x86 x86-64 compiler-optimization auto-vectorization

Why is gcc -O3 auto-vectorizing factorial? That many extra instructions looks worse

Here's a very simple factorial function.

int factorial(int num) {
    if (num == 0)
        return 1;
    return num*factorial(num-1);
}

GCC's assembly for this function on -O2 is reasonable.

factorial(int):
        mov     eax, 1
        test    edi, edi
        je      .L1
.L2:
        imul    eax, edi
        sub     edi, 1
        jne     .L2
.L1:
        ret

However, on -O3 or -Ofast, it decides to make things way more complicated (almost 100 lines!):

factorial(int):
        test    edi, edi
        je      .L28
        lea     edx, [rdi-1]
        mov     ecx, edi
        cmp     edx, 6
        jbe     .L8
        mov     DWORD PTR [rsp-12], edi
        movd    xmm5, DWORD PTR [rsp-12]
        mov     edx, edi
        xor     eax, eax
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        movdqa  xmm4, XMMWORD PTR .LC2[rip]
        shr     edx, 2
        pshufd  xmm2, xmm5, 0
        paddd   xmm2, XMMWORD PTR .LC1[rip]
.L5:
        movdqa  xmm3, xmm2
        movdqa  xmm1, xmm2
        paddd   xmm2, xmm4
        add     eax, 1
        pmuludq xmm3, xmm0
        psrlq   xmm1, 32
        psrlq   xmm0, 32
        pmuludq xmm1, xmm0
        pshufd  xmm0, xmm3, 8
        pshufd  xmm1, xmm1, 8
        punpckldq       xmm0, xmm1
        cmp     eax, edx
        jne     .L5
        movdqa  xmm2, xmm0
        movdqa  xmm1, xmm0
        mov     edx, edi
        psrldq  xmm2, 8
        psrlq   xmm0, 32
        and     edx, -4
        pmuludq xmm1, xmm2
        psrlq   xmm2, 32
        sub     edi, edx
        pmuludq xmm0, xmm2
        pshufd  xmm1, xmm1, 8
        pshufd  xmm0, xmm0, 8
        punpckldq       xmm1, xmm0
        movdqa  xmm0, xmm1
        psrldq  xmm1, 4
        pmuludq xmm0, xmm1
        movd    eax, xmm0
        cmp     ecx, edx
        je      .L1
        lea     edx, [rdi-1]
.L3:
        imul    eax, edi
        test    edx, edx
        je      .L1
        imul    eax, edx
        mov     edx, edi
        sub     edx, 2
        je      .L1
        imul    eax, edx
        mov     edx, edi
        sub     edx, 3
        je      .L1
        imul    eax, edx
        mov     edx, edi
        sub     edx, 4
        je      .L1
        imul    eax, edx
        mov     edx, edi
        sub     edx, 5
        je      .L1
        imul    eax, edx
        sub     edi, 6
        je      .L1
        imul    eax, edi
.L1:
        ret
.L28:
        mov     eax, 1
        ret
.L8:
        mov     eax, 1
        jmp     .L3
.LC0:
        .long   1
        .long   1
        .long   1
        .long   1
.LC1:
        .long   0
        .long   -1
        .long   -2
        .long   -3
.LC2:
        .long   -4
        .long   -4
        .long   -4
        .long   -4

I got these results using Compiler Explorer, so it should be the same in a real-world use case.

What's up with that? Are there any cases where this would be faster? Clang seems to do something like this too, but on -O2.

Solution

imul r32,r32 has 3 cycle latency on typical modern x86 CPUs (http://agner.org/optimize/). So the scalar implementation can do one multiply per 3 clock cycles, because they're dependent. It's fully pipelined, though, so your scalar loop leaves 2/3rds of the potential throughput unused.

In 3 cycles, the pipeline in Core2 or later can feed 12 uops into the out-of-order part of the core. For small inputs, it might be best to keep the code small and let out-of-order execution overlap the dependency chain with later code, especially if that later code doesn't all depend on the factorial result. But compilers aren't good at knowing when to optimize for latency vs. throughput, and without profile-guided optimization they have no data on how large n usually is.

I suspect that gcc's auto-vectorizer isn't looking at how quickly this will overflow for large n.

A useful scalar optimization would have been unrolling with multiple accumulators, e.g. take advantage of the fact that multiplication is associative and do these in parallel in the loop: prod(n*3/4 .. n) * prod(n/2 .. n*3/4) * prod(n/4 .. n/2) * prod(1..n/4) (with non-overlapping ranges, of course). Multiplication is associative even when it wraps; the product bits only depend on bits at that position and lower, not on (discarded) high bits.

Or more simply, do f0 *= i; f1 *= i+1; f2 *= i+2; f3 *= i+3; i+=4;. And then outside the loop, return (f0*f1) * (f2*f3);. This would be a win in scalar code, too. Of course you also have to account for n % 4 != 0 when unrolling.

What gcc has chosen to do is basically the latter, using pmuludq to do 2 packed multiplies with one instruction (5c latency / 1c or 0.5c throughput on Intel CPUs) It's similar on AMD CPUs; see Agner Fog's instruction tables. Each vector loop iteration does 4 iterations of the factorial loop in your C source, and there's significant instruction-level parallelism within one iteration

The inner loop is only 12 uops long (cmp/jcc macro-fuses into 1), so it can issue at 1 iteration per 3 cycles, same throughput as the latency bottleneck in your scalar version, but doing 4x as much work per iteration.

.L5:
    movdqa  xmm3, xmm2         ; copy the old i vector
    movdqa  xmm1, xmm2
    paddd   xmm2, xmm4         ; [ i0,  i1 |  i2,  i3 ]  += 4
    add     eax, 1
    pmuludq xmm3, xmm0         ; [ f0      |  f2  ] *= [ i0   |  i2  ]

    psrlq   xmm1, 32           ; bring odd 32 bit elements down to even: [ i1  | i3 ]
    psrlq   xmm0, 32
    pmuludq xmm1, xmm0         ; [ f1  | f3 ] *= [ i1  | i3 ]

    pshufd  xmm0, xmm3, 8
    pshufd  xmm1, xmm1, 8
    punpckldq       xmm0, xmm1   ; merge back into [ f0  f1  f2  f3 ]
    cmp     eax, edx
    jne     .L5

So gcc wastes a whole lot of effort emulating a packed 32-bit multiply instead of leaving two separate vector accumulators separate when using pmuludq. I also looked at clang6.0. I think it's falling into the same trap. (Source+asm on the Godbolt compiler explorer)

You didn't use -march=native or anything, so only SSE2 (baseline for x86-64) is available, so only widening 32x32 => 64 bit SIMD multiplies like pmuludq are available for 32-bit input elements. SSE4.1 pmulld is 2 uops on Haswell and later (single-uop on Sandybridge), but would avoid all of gcc's stupid shuffling.

Of course there's a latency bottleneck here, too, especially because of gcc's missed optimizations increasing the length of the loop-carried dep chain involving the accumulators.

Unrolling with more vector accumulators could hide a lot of the pmuludq latency.

With good vectorization, the SIMD integer multipliers can manage 2x or 4x the throughput of the scalar integer multiply unit. (Or, with AVX2, 8x the throughput using vectors of 8x 32-bit integers.)

But the wider the vectors and the more unrolling, the more cleanup code you need.

`gcc -march=haswell`

We get an inner loop like this:

.L5:
    inc     eax
    vpmulld ymm1, ymm1, ymm0
    vpaddd  ymm0, ymm0, ymm2
    cmp     eax, edx
    jne     .L5

Super simple, but a 10c latency loop-carried dependency chain :/ (pmulld is 2 dependent uops on Haswell and later). Unrolling with multiple accumulators can give up to a 10x throughput boost for large inputs, for 5c latency / 0.5c throughput for SIMD integer multiply uops on Skylake.

But 4 multiplies per 5 cycles is still much better than 1 per 3 for scalar.

Clang unrolls with multiple accumulators by default, so it should be good. But it's a lot of code, so I didn't analyze it by hand. Plug it into IACA or benchmark it for large inputs. (What is IACA and how do I use it?)

Efficient strategies for handling unroll epilogue:

A lookup table for factorial [0..7] is probably the best bet. Arrange things so your vector / unrolled loop does n%8 .. n, instead of 1 .. n/8*8, so the left-over part is always the same for every n.

After a horizontal vector product, do one more scalar multiply with the table lookup result. A SIMD loop already needs some vector constants so you'll probably touch memory anyway, and the table lookup can happen in parallel with the main loop.

8! is 40320, which fits in 16 bits, so a 1..8 lookup table only needs 8 * 2 bytes of storage. Or use 32-bit entries so you can use a memory source operand for imul instead of a separate movzx.