Search code examples
performancex86-64instruction-set

CISC short instructions vs long instructions


I'm currently programming a compiler and am about to implement code generation. The target instruction set for now is x64.
Now x64 is CISC, so there are many complex instructions. But I know these are internally converted to RISC by the CPU, and there's also out-of-order execution after that.
My question is therefore: Does using more short instructions (RISC-like) have a performance impact over using fewer complex instructions? The test programs for my language aren't that big, so I think fitting instructions into the cache should currently not be a problem.


Solution

  • No, using mostly simple x86 instructions (e.g. avoiding push and using sub rsp, whatever and storing args with mov) was a useful optimization for P5-pentium, because it didn't know how to split compact but complex instructions internally. Its 2-wide superscalar pipeline could only pair simple instructions.

    Modern x86 CPUs (since Intel P6 (pentium pro / PIII), and including all x86-64 CPUs) do decode complex instructions to multiple uops that can be scheduled independently. (And for common complex instructions like push / pop, they have tricks to handle them as a single uop. In that case, a stack engine that renames the stack pointer outside of the out-of-order part of the core, so no uop is needed for the rsp-=8 part of push.)

    Memory-source instructions like add eax, [rdi] can even decode to a single uop on Intel CPUs by micro-fusing the load with the ALU uop, only separating them in the out-of-order scheduler for dispatching to execution units. In the rest of the pipeline, it only uses 1 entry (in the front-end and the ROB). (But see Micro fusion and addressing modes for limitations on Sandybridge with indexed addressing modes, relaxed somewhat on Haswell and later.) AMD CPUs just naturally keep memory operands fused with ALU instructions, and didn't used to decode them to extra m-ops / uops so it doesn't have a fancy name.

    Instruction length is not perfectly correlated with simplicitly. e.g. idiv rcx is only 3 bytes, but decodes to 57 uops on Skylake. (Avoid 64-bit division, it's slower than 32-bit.)


    Smaller code is better, all else equal. Prefer 32-bit operand-size when it's sufficient to avoid REX prefixes, and choose register that don't need REX prefixes (like ecx instead of r8d). But normally don't spend extra instructions to make that happen. (e.g. use r8d instead of saving/restoring rbx so you can use ebx as another scratch register).

    But when all else is not equal, size is usually the last priority for high performance, behind minimizing uops and keeping latency dependency chains short (especially loop-carried dependency chains).


    Most programs spend most of their time in loops small enough to fit in L1d cache, and lots of that time in a few even smaller loops within that.

    Unless you can correctly identify "cold" code (executed rarely), optimizing for size over speed with something like 3-byte push 1 / pop rax instead of 5-byte mov eax, 1 is definitely not a good default. clang/LLVM will push/pop for constants with -Oz (optimize only for size), but not -Os (optimize for a balance of size and speed).

    Using inc instead of add reg,1 saves a byte (only 1 in x86-64, vs. 2 in 32-bit code). With a register destination, it's just as fast in most cases on most CPUs. See INC instruction vs ADD 1: Does it matter?


    Modern mainstream x86 CPUs have decoded-uop caches (AMD since Ryzen, Intel since Sandybridge) that mostly avoid the front-end bottlenecks on older CPUs with average instruction length > 4.

    Before that (Core2 / Nehalem), tuning to avoid front-end bottlenecks was much more complicated that just using short instructions on average. See Agner Fog's microarch guide for details about the uop patterns the decoders can handle in those older Intel CPUs, and effects of code alignment relative to 16-byte boundaries for fetch after a jump, and lots more.

    AMD Bulldozer-family marks instruction boundaries in L1i cache, and can decode up to 2x 16 bytes per cycle if both cores of a cluster are active, otherwise Agner Fog's microarch PDF (https://agner.org/optimize/) reports ~21 bytes per cycle (vs. Intel's up to 16 bytes per cycle for the decoders when not running from the uop cache). Bulldozer's lower back-end throughput probably means that front-end bottlenecks happen less often. But I don't really know, I haven't tuned anything for Bulldozer-family with access to hardware to test anything.


    An example: this function compiled with clang with -O3, -Os, and -Oz

    int sum(int*arr) {
        int sum = 0;
        for(int i=0;i<10240;i++) {
            sum+=arr[i];
        }
        return sum;
    }
    

    Source + asm output on the Godbolt compiler explorer, where you can play with this code and compiler options.

    I also used -fno-vectorize because I assume you won't be trying to auto-vectorize with SSE2, even though that's baseline for x86-64. (Although that would speed up this loop by a factor of 4

    # clang -O3 -fno-vectorize
    sum:                                    # @sum
            xor     eax, eax
            mov     ecx, 7
    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
            add     eax, dword ptr [rdi + 4*rcx - 28]
            add     eax, dword ptr [rdi + 4*rcx - 24]
            add     eax, dword ptr [rdi + 4*rcx - 20]
            add     eax, dword ptr [rdi + 4*rcx - 16]
            add     eax, dword ptr [rdi + 4*rcx - 12]
            add     eax, dword ptr [rdi + 4*rcx - 8]
            add     eax, dword ptr [rdi + 4*rcx - 4]
            add     eax, dword ptr [rdi + 4*rcx]
            add     rcx, 8
            cmp     rcx, 10247
            jne     .LBB2_1
            ret
    

    This is pretty silly; it unrolled by 8 but still with only 1 accumulator. So it bottlenecks on 1 cycle latency add instead of on 2 loads per clock throughput on Intel since SnB and AMD since K8. (And only reading 4 bytes per clock cycle, it probably doesn't bottleneck on memory bandwidth very much.)

    It does better with normal -O3, not disabling vectorization, using 2 vector accumulators:

    sum:                                    # @sum
        pxor    xmm0, xmm0           # zero first vector register
        mov     eax, 36
        pxor    xmm1, xmm1           # 2nd vector
    
    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
        movdqu  xmm2, xmmword ptr [rdi + 4*rax - 144]
        paddd   xmm2, xmm0
        movdqu  xmm0, xmmword ptr [rdi + 4*rax - 128]
        paddd   xmm0, xmm1
        movdqu  xmm1, xmmword ptr [rdi + 4*rax - 112]
        movdqu  xmm3, xmmword ptr [rdi + 4*rax - 96]
        movdqu  xmm4, xmmword ptr [rdi + 4*rax - 80]
        paddd   xmm4, xmm1
        paddd   xmm4, xmm2
        movdqu  xmm2, xmmword ptr [rdi + 4*rax - 64]
        paddd   xmm2, xmm3
        paddd   xmm2, xmm0
        movdqu  xmm1, xmmword ptr [rdi + 4*rax - 48]
        movdqu  xmm3, xmmword ptr [rdi + 4*rax - 32]
        movdqu  xmm0, xmmword ptr [rdi + 4*rax - 16]
        paddd   xmm0, xmm1
        paddd   xmm0, xmm4
        movdqu  xmm1, xmmword ptr [rdi + 4*rax]
        paddd   xmm1, xmm3
        paddd   xmm1, xmm2
        add     rax, 40
        cmp     rax, 10276
        jne     .LBB2_1
    
        paddd   xmm1, xmm0        # add the two accumulators
    
         # and horizontal sum the result
        pshufd  xmm0, xmm1, 78          # xmm0 = xmm1[2,3,0,1]
        paddd   xmm0, xmm1
        pshufd  xmm1, xmm0, 229         # xmm1 = xmm0[1,1,2,3]
        paddd   xmm1, xmm0
    
        movd    eax, xmm1         # extract the result into a scalar integer reg
        ret
    

    This version unrolls probably more than it needs to; the loop overhead is tiny and movdqu + paddd is only 2 uops, so we're far from bottlenecking on the front-end. With 2-per-clock movdqu loads, this loop can process 32 bytes of input per clock cycle assuming the data is hot in L1d cache or maybe L2, otherwise it will run slower. This more-than-minimum unroll will let out-of-order execution run ahead and see the loop exit condition before the paddd work has caught up, and maybe mostly hide the branch mispredict on the last iteration.

    Using more than 2 accumulators to hide latency is very important in FP code, where most instructions don't have single-cycle latency. (It would also be useful for this function on AMD Bulldozer-family, where paddd has 2 cycle latency.)

    With big unrolls and large displacements, compilers sometimes generate a lot of instructions that need a disp32 displacement instead of disp8 in the addressing mode. Choosing the point where you increment the loop counter or pointer to keep as many addressing modes as possible using a displacement of -128 .. +127 would probably be a good thing.

    Unless you're tuning for Nehalem / Core2 or other CPUs without a uop cache, you probably don't want to add extra loop overhead (of add rdi, 256 twice instead of add rdi, 512 or something) just to shrink the code size.


    By comparison, clang -Os still auto-vectorizes (unless you disable it), with an inner loop that's exactly 4 uops long on Intel CPUs.

    # clang -Os
    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
        movdqu  xmm1, xmmword ptr [rdi + 4*rax]
        paddd   xmm0, xmm1
        add     rax, 4
        cmp     rax, 10240
        jne     .LBB2_1
    

    But with clang -Os -fno-vectorize, we get the simple and obvious minimal scalar implementation:

    # clang -Os -fno-vectorize
    sum:                                    # @sum
        xor     ecx, ecx
        xor     eax, eax
    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
        add     eax, dword ptr [rdi + 4*rcx]
        inc     rcx
        cmp     rcx, 10240
        jne     .LBB2_1
        ret
    

    Missed-optimization: using ecx would avoid a REX prefix on inc and cmp. The range is known to fix in 32-bits. Probably it's using RCX because it promoted int to 64-bit to avoid movsxd rcx,ecx sign-extension to 64-bit before use in an addressing mode. (Because signed overflow is UB in C.) But after doing that, it could optimize it back down again after noticing the range.

    The loop is 3 uops (assuming macro-fused cmp/jne on Intel since Nehalem and AMD since Bulldozer), or 4 uops on Sandybridge (unlamination of add with an indexed addressing mode.) A pointer-increment loop could be slightly more efficient on some CPUs, only requiring 3 uops inside the loop even on SnB/IvB.


    Clang's -Oz output is actually larger, showing signs of its code-gen strategy. Many loops can't be proven to run at least 1 time, and thus need a conditional branch to skip the loop instead of falling into it in the run-zero-times case. Or they need a jump to an entry point near the bottom. (Why are loops always compiled into "do...while" style (tail jump)?).

    Looks like LLVM's -Oz code-gen unconditionally uses the jump-to-bottom strategy without checking if the condition is provably always true on the first iteration.

     sum:                                    # @sum
        xor     ecx, ecx
        xor     eax, eax
        jmp     .LBB2_1
    .LBB2_3:                                #   in Loop: Header=BB2_1 Depth=1
        add     eax, dword ptr [rdi + 4*rcx]
        inc     rcx
    .LBB2_1:                                # =>This Inner Loop Header: Depth=1
        cmp     rcx, 10240
        jne     .LBB2_3
        ret
    

    Everything's the same except the extra jmp to enter the loop.

    In a function that did more, you'd see more differences in code-gen. Like maybe using a slow div even for compile-time-constants, instead of a multiplicative inverse (Why does GCC use multiplication by a strange number in implementing integer division?).