I'm currently programming a compiler and am about to implement code generation. The target instruction set for now is x64.
Now x64 is CISC, so there are many complex instructions. But I know these are internally converted to RISC by the CPU, and there's also out-of-order execution after that.
My question is therefore: Does using more short instructions (RISC-like) have a performance impact over using fewer complex instructions? The test programs for my language aren't that big, so I think fitting instructions into the cache should currently not be a problem.
No, using mostly simple x86 instructions (e.g. avoiding push
and using sub rsp, whatever
and storing args with mov
) was a useful optimization for P5-pentium, because it didn't know how to split compact but complex instructions internally. Its 2-wide superscalar pipeline could only pair simple instructions.
Modern x86 CPUs (since Intel P6 (pentium pro / PIII), and including all x86-64 CPUs) do decode complex instructions to multiple uops that can be scheduled independently. (And for common complex instructions like push
/ pop
, they have tricks to handle them as a single uop. In that case, a stack engine that renames the stack pointer outside of the out-of-order part of the core, so no uop is needed for the rsp-=8
part of push
.)
Memory-source instructions like add eax, [rdi]
can even decode to a single uop on Intel CPUs by micro-fusing the load with the ALU uop, only separating them in the out-of-order scheduler for dispatching to execution units. In the rest of the pipeline, it only uses 1 entry (in the front-end and the ROB). (But see Micro fusion and addressing modes for limitations on Sandybridge with indexed addressing modes, relaxed somewhat on Haswell and later.) AMD CPUs just naturally keep memory operands fused with ALU instructions, and didn't used to decode them to extra m-ops / uops so it doesn't have a fancy name.
Instruction length is not perfectly correlated with simplicitly. e.g. idiv rcx
is only 3 bytes, but decodes to 57 uops on Skylake. (Avoid 64-bit division, it's slower than 32-bit.)
Smaller code is better, all else equal. Prefer 32-bit operand-size when it's sufficient to avoid REX prefixes, and choose register that don't need REX prefixes (like ecx
instead of r8d
). But normally don't spend extra instructions to make that happen. (e.g. use r8d
instead of saving/restoring rbx
so you can use ebx
as another scratch register).
But when all else is not equal, size is usually the last priority for high performance, behind minimizing uops and keeping latency dependency chains short (especially loop-carried dependency chains).
Most programs spend most of their time in loops small enough to fit in L1d cache, and lots of that time in a few even smaller loops within that.
Unless you can correctly identify "cold" code (executed rarely), optimizing for size over speed with something like 3-byte push 1
/ pop rax
instead of 5-byte mov eax, 1
is definitely not a good default. clang/LLVM will push/pop for constants with -Oz
(optimize only for size), but not -Os
(optimize for a balance of size and speed).
Using inc
instead of add reg,1
saves a byte (only 1 in x86-64, vs. 2 in 32-bit code). With a register destination, it's just as fast in most cases on most CPUs. See INC instruction vs ADD 1: Does it matter?
Modern mainstream x86 CPUs have decoded-uop caches (AMD since Ryzen, Intel since Sandybridge) that mostly avoid the front-end bottlenecks on older CPUs with average instruction length > 4.
Before that (Core2 / Nehalem), tuning to avoid front-end bottlenecks was much more complicated that just using short instructions on average. See Agner Fog's microarch guide for details about the uop patterns the decoders can handle in those older Intel CPUs, and effects of code alignment relative to 16-byte boundaries for fetch after a jump, and lots more.
AMD Bulldozer-family marks instruction boundaries in L1i cache, and can decode up to 2x 16 bytes per cycle if both cores of a cluster are active, otherwise Agner Fog's microarch PDF (https://agner.org/optimize/) reports ~21 bytes per cycle (vs. Intel's up to 16 bytes per cycle for the decoders when not running from the uop cache). Bulldozer's lower back-end throughput probably means that front-end bottlenecks happen less often. But I don't really know, I haven't tuned anything for Bulldozer-family with access to hardware to test anything.
An example: this function compiled with clang with -O3
, -Os
, and -Oz
int sum(int*arr) {
int sum = 0;
for(int i=0;i<10240;i++) {
sum+=arr[i];
}
return sum;
}
Source + asm output on the Godbolt compiler explorer, where you can play with this code and compiler options.
I also used -fno-vectorize
because I assume you won't be trying to auto-vectorize with SSE2, even though that's baseline for x86-64. (Although that would speed up this loop by a factor of 4
# clang -O3 -fno-vectorize
sum: # @sum
xor eax, eax
mov ecx, 7
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx - 28]
add eax, dword ptr [rdi + 4*rcx - 24]
add eax, dword ptr [rdi + 4*rcx - 20]
add eax, dword ptr [rdi + 4*rcx - 16]
add eax, dword ptr [rdi + 4*rcx - 12]
add eax, dword ptr [rdi + 4*rcx - 8]
add eax, dword ptr [rdi + 4*rcx - 4]
add eax, dword ptr [rdi + 4*rcx]
add rcx, 8
cmp rcx, 10247
jne .LBB2_1
ret
This is pretty silly; it unrolled by 8 but still with only 1 accumulator. So it bottlenecks on 1 cycle latency add
instead of on 2 loads per clock throughput on Intel since SnB and AMD since K8. (And only reading 4 bytes per clock cycle, it probably doesn't bottleneck on memory bandwidth very much.)
It does better with normal -O3, not disabling vectorization, using 2 vector accumulators:
sum: # @sum
pxor xmm0, xmm0 # zero first vector register
mov eax, 36
pxor xmm1, xmm1 # 2nd vector
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm2, xmmword ptr [rdi + 4*rax - 144]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rax - 128]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rdi + 4*rax - 112]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 96]
movdqu xmm4, xmmword ptr [rdi + 4*rax - 80]
paddd xmm4, xmm1
paddd xmm4, xmm2
movdqu xmm2, xmmword ptr [rdi + 4*rax - 64]
paddd xmm2, xmm3
paddd xmm2, xmm0
movdqu xmm1, xmmword ptr [rdi + 4*rax - 48]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 32]
movdqu xmm0, xmmword ptr [rdi + 4*rax - 16]
paddd xmm0, xmm1
paddd xmm0, xmm4
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm1, xmm3
paddd xmm1, xmm2
add rax, 40
cmp rax, 10276
jne .LBB2_1
paddd xmm1, xmm0 # add the two accumulators
# and horizontal sum the result
pshufd xmm0, xmm1, 78 # xmm0 = xmm1[2,3,0,1]
paddd xmm0, xmm1
pshufd xmm1, xmm0, 229 # xmm1 = xmm0[1,1,2,3]
paddd xmm1, xmm0
movd eax, xmm1 # extract the result into a scalar integer reg
ret
This version unrolls probably more than it needs to; the loop overhead is tiny and movdqu
+ paddd
is only 2 uops, so we're far from bottlenecking on the front-end. With 2-per-clock movdqu
loads, this loop can process 32 bytes of input per clock cycle assuming the data is hot in L1d cache or maybe L2, otherwise it will run slower. This more-than-minimum unroll will let out-of-order execution run ahead and see the loop exit condition before the paddd
work has caught up, and maybe mostly hide the branch mispredict on the last iteration.
Using more than 2 accumulators to hide latency is very important in FP code, where most instructions don't have single-cycle latency. (It would also be useful for this function on AMD Bulldozer-family, where paddd
has 2 cycle latency.)
With big unrolls and large displacements, compilers sometimes generate a lot of instructions that need a disp32
displacement instead of disp8
in the addressing mode. Choosing the point where you increment the loop counter or pointer to keep as many addressing modes as possible using a displacement of -128 .. +127 would probably be a good thing.
Unless you're tuning for Nehalem / Core2 or other CPUs without a uop cache, you probably don't want to add extra loop overhead (of add rdi, 256
twice instead of add rdi, 512
or something) just to shrink the code size.
By comparison, clang -Os
still auto-vectorizes (unless you disable it), with an inner loop that's exactly 4 uops long on Intel CPUs.
# clang -Os
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm0, xmm1
add rax, 4
cmp rax, 10240
jne .LBB2_1
But with clang -Os -fno-vectorize
, we get the simple and obvious minimal scalar implementation:
# clang -Os -fno-vectorize
sum: # @sum
xor ecx, ecx
xor eax, eax
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
cmp rcx, 10240
jne .LBB2_1
ret
Missed-optimization: using ecx
would avoid a REX prefix on inc
and cmp
. The range is known to fix in 32-bits. Probably it's using RCX because it promoted int
to 64-bit to avoid movsxd rcx,ecx
sign-extension to 64-bit before use in an addressing mode. (Because signed overflow is UB in C.) But after doing that, it could optimize it back down again after noticing the range.
The loop is 3 uops (assuming macro-fused cmp/jne on Intel since Nehalem and AMD since Bulldozer), or 4 uops on Sandybridge (unlamination of add with an indexed addressing mode.) A pointer-increment loop could be slightly more efficient on some CPUs, only requiring 3 uops inside the loop even on SnB/IvB.
Clang's -Oz
output is actually larger, showing signs of its code-gen strategy. Many loops can't be proven to run at least 1 time, and thus need a conditional branch to skip the loop instead of falling into it in the run-zero-times case. Or they need a jump to an entry point near the bottom. (Why are loops always compiled into "do...while" style (tail jump)?).
Looks like LLVM's -Oz
code-gen unconditionally uses the jump-to-bottom strategy without checking if the condition is provably always true on the first iteration.
sum: # @sum
xor ecx, ecx
xor eax, eax
jmp .LBB2_1
.LBB2_3: # in Loop: Header=BB2_1 Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
.LBB2_1: # =>This Inner Loop Header: Depth=1
cmp rcx, 10240
jne .LBB2_3
ret
Everything's the same except the extra jmp
to enter the loop.
In a function that did more, you'd see more differences in code-gen. Like maybe using a slow div
even for compile-time-constants, instead of a multiplicative inverse (Why does GCC use multiplication by a strange number in implementing integer division?).