Search code examples
assemblyoptimizationclangx86-64division

Why does Clang do this optimization trick only from Sandy Bridge onward?


I noticed that Clang does an interesting division optimization trick for the following snippet

int64_t s2(int64_t a, int64_t b)
{
    return a/b;
}

Below is the assembly output if specifying march as Sandy Bridge or above

        mov     rax, rdi
        mov     rcx, rdi
        or      rcx, rsi
        shr     rcx, 32
        je      .LBB1_1
        cqo
        idiv    rsi
        ret
.LBB1_1:
        xor     edx, edx
        div     esi
        ret

Here are the Godbolt links for the signed version and the unsigned version

From what I understand it checks whether the high bits of the two operands are zero, and does a 32-bit division if that's true

I checked this table and see that the latencies for 32/64-bit division on Core2 and Nehalem are 40/116 and 26/89 respectively. Hence if the operands are indeed often not wide then the savings by doing a 32-bit division instead of a 64-bit one may be just as worth as on SnB

So why is it enabled only for SnB and later microarchitectures? Why don't other compilers like GCC or ICC do it?


Solution

  • I'm guessing that clang devs tested which uarches it was good on, and found it was only SnB-family.

    That sounds right, because of a funky stall on P6-family, and AMD's different dividers.


    Using the flag result from a shift imm8 (not a shift-by-implicit-1) on P6-family causes the front-end to stall before issuing the flag-reading instruction until the shift is retired. (Because the P6 decoders don't check for the imm8=0 case for leaving flags unmodified, while SnB does). INC instruction vs ADD 1: Does it matter?. That might be why clang doesn't use it for P6-family.

    Probably a different way of checking the relevant condition that didn't cause this stall (like a test rcx,rcx before the je, would make it worth it on Core2/Nehalem). But if clang devs didn't realize the reason it was slow on P6-family, they wouldn't have thought to fix it, and just left it not done for pre-SnB targets. (Nobody added me to a patch review or bug CC list about this one, unfortunately; this is the first I've seen of clang doing this optimization. Although I think I might have mentioned shift flag stalls in comments on some other LLVM review or bug. Anyway, might be fun to try adding a test and see if that makes it worthwhile on Nehalem.)


    AMD's dividers have the same best-case div performance regardless of operand-size, presumably depending only on the actual magnitude of the inputs, according to Agner Fog. Only the worst-case grows with operand-size. So I think it's harmless to run idiv r64 with small inputs sign-extended to 128 / 64-bit on AMD. (div/idiv on AMD is 2 uops for all operand sizes (except 8-bit where it's one because it only has to write one output register: AH and AL = AX. Unlike Intel's microcoded integer division.)

    Intel before Ice Lake is very different: idiv r32 is 9 uops, vs. idiv r64 being 59 uops, with a best-case throughput that's 3x worse, on Haswell. Other members of SnB-family up to Skylake and its derivatives (like Cascade Lake) are similar.

    Ice Lake added a new integer-division unit that handles idiv r64 in the same number of uops as idiv r32, so probably the same performance for the same input numbers on that and newer Intel, at least for the P-cores in hybrid CPUs like Alder Lake. (For idiv r64 with inputs that could have fit in i64 dividend and i32 divisor).

    Related:

    Why don't other compilers like GCC or ICC do it?

    Probably because Clang developers thought of it, and GCC/ICC haven't copied them yet. If you've watched Chandler Carruth's talks about perf, one example he used was playing around with a branch to skip a div. I'm guessing this optimization was his idea. Looks nifty. :)