Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.
I distilled down the example to this:
long count(void** x)
{
long i = 0;
while (*x)
{
++i;
x = (void**)*x;
}
return i;
}
Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:
count(void**):
xor eax, eax #6.10
mov rcx, QWORD PTR [rdi] #7.11
test rcx, rcx #7.11
je ..B1.6 # Prob 1% #7.11
mov rdx, rax #7.3
..B1.3: # Preds ..B1.4 ..B1.2
inc rdx #7.3
mov rcx, QWORD PTR [rcx] #7.11
lea rsi, QWORD PTR [rdx+rdx] #9.7
lea rax, QWORD PTR [-1+rdx*2] #9.7
test rcx, rcx #7.11
je ..B1.6 # Prob 18% #7.11
mov rcx, QWORD PTR [rcx] #7.11
mov rax, rsi #9.7
test rcx, rcx #7.11
jne ..B1.3 # Prob 82% #7.11
..B1.6: # Preds ..B1.3 ..B1.4 ..B1.1
ret #12.10
Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:
What are the potential benefits to doing all this? I assume it may have something to do with scheduling?
Just for comparison, this is the code generated by gcc 6.2:
count(void**):
mov rdx, QWORD PTR [rdi]
xor eax, eax
test rdx, rdx
je .L4
.L3:
mov rdx, QWORD PTR [rdx]
add rax, 1
test rdx, rdx
jne .L3
rep ret
.L4:
rep ret
This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.
Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).
Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i
is pretty braindead, since it used more uops than two inc rax
instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64
, so out-of-order execution can get ahead on running those uops).
Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).
IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.
For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.
Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.
Here's what a hand-written unrolled-by-two implementation looks like:
Fix up i
in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.
count(void**):
xor eax, eax # counter
mov rcx, QWORD PTR [rdi] # *x
test rcx, rcx
je ..B1.6
.p2align 4 # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
mov rcx, QWORD PTR [rcx]
test rcx, rcx
jz .plus1
mov rcx, QWORD PTR [rcx]
add rax, 2
test rcx, rcx
jnz .loop
..B1.6:
ret
.plus1: # exit path for odd counts
inc rax
ret
This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.
gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.