Why does ICC unroll this loop in this way and use lea for arithmetic?

Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.

I distilled down the example to this:

long count(void** x)
{
  long i = 0;
  while (*x)
  {
    ++i;
    x = (void**)*x;
  }
  return i;
}

Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:

count(void**):
        xor       eax, eax                                      #6.10
        mov       rcx, QWORD PTR [rdi]                          #7.11
        test      rcx, rcx                                      #7.11
        je        ..B1.6        # Prob 1%                       #7.11
        mov       rdx, rax                                      #7.3
..B1.3:                         # Preds ..B1.4 ..B1.2
        inc       rdx                                           #7.3
        mov       rcx, QWORD PTR [rcx]                          #7.11
        lea       rsi, QWORD PTR [rdx+rdx]                      #9.7
        lea       rax, QWORD PTR [-1+rdx*2]                     #9.7
        test      rcx, rcx                                      #7.11
        je        ..B1.6        # Prob 18%                      #7.11
        mov       rcx, QWORD PTR [rcx]                          #7.11
        mov       rax, rsi                                      #9.7
        test      rcx, rcx                                      #7.11
        jne       ..B1.3        # Prob 82%                      #7.11
..B1.6:                         # Preds ..B1.3 ..B1.4 ..B1.1
        ret                                                     #12.10

Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:

It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
It uses lea for some of the arithmetic
It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)

What are the potential benefits to doing all this? I assume it may have something to do with scheduling?

Just for comparison, this is the code generated by gcc 6.2:

count(void**):
        mov     rdx, QWORD PTR [rdi]
        xor     eax, eax
        test    rdx, rdx
        je      .L4
.L3:
        mov     rdx, QWORD PTR [rdx]
        add     rax, 1
        test    rdx, rdx
        jne     .L3
        rep ret
.L4:
        rep ret

Solution

This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.

Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).

Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i is pretty braindead, since it used more uops than two inc rax instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64, so out-of-order execution can get ahead on running those uops).

Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).

IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.

For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.

Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.

Here's what a hand-written unrolled-by-two implementation looks like:

Fix up i in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.

count(void**):
    xor       eax, eax              # counter
    mov       rcx, QWORD PTR [rdi]  # *x
    test      rcx, rcx
    je        ..B1.6
.p2align 4   # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
    mov       rcx, QWORD PTR [rcx]
    test      rcx, rcx
    jz        .plus1

    mov       rcx, QWORD PTR [rcx]
    add       rax, 2
    test      rcx, rcx
    jnz       .loop
..B1.6:
    ret

.plus1:           # exit path for odd counts
    inc       rax
    ret

This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.

gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.