Search code examples
cassemblygcccompiler-optimizationunreachable-code

Why is gcc emitting worse code with __builtin_unreachable?


With f0 and f1 as below,

long long b;

void f0(int a) {
    a %= 10;
    if (a == 0) b += 11;
    else if (a == 1) b += 13;
    else if (a == 2) b += 17;
    else if (a == 3) b += 19;
    else if (a == 4) b += 23;
    else if (a == 5) b += 29;
    else if (a == 6) b += 31;
    else if (a == 7) b += 37;
    else if (a == 8) b += 41;
    else if (a == 9) b += 43;
}

void f1(int a) {
    a %= 10;
    if (a == 0) b += 11;
    else if (a == 1) b += 13;
    else if (a == 2) b += 17;
    else if (a == 3) b += 19;
    else if (a == 4) b += 23;
    else if (a == 5) b += 29;
    else if (a == 6) b += 31;
    else if (a == 7) b += 37;
    else if (a == 8) b += 41;
    else if (a == 9) b += 43;
    else __builtin_unreachable();
}

assuming the argument a is always positive in the program, the compiler should produce more optimized code for f1 because in f0, a can fall through the if-else block when it is negative, so the compiler should produce a default "do nothing and return" code. However in f1, the possible range of a is clearly stated with __builtin_unreachable so that the compiler doesn't have to think when a is out of range.

However, f1 actually runs slower, so I had a look at the disassembly. This is the control flow part of f0.

    jne .L2
    addq    $11, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L2:
    cmpl    $9, %eax
    ja  .L1
    movl    %eax, %eax
    jmp *.L5(,%rax,8)
    .section    .rodata
    .align 8
    .align 4
.L5:
    .quad   .L1
    .quad   .L13
    .quad   .L12
    .quad   .L11
    .quad   .L10
    .quad   .L9
    .quad   .L8
    .quad   .L7
    .quad   .L6
    .quad   .L4
    .text
    .p2align 4,,10
    .p2align 3
.L4:
    addq    $43, b(%rip)
.L1:
    ret
    .p2align 4,,10
    .p2align 3
.L6:
    addq    $41, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
...

gcc smartly turns the if-else block into a jump table and places the default case L1 inside L4 to save space.

Now have a look at the whole control flow of f1 disassembled.

    jne .L42
    movq    b(%rip), %rax
    addq    $11, %rax
.L43:
    movq    %rax, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L42:
    movl    %eax, %eax
    jmp *.L46(,%rax,8)
    .section    .rodata
    .align 8
    .align 4
.L46:
    .quad   .L45
    .quad   .L54
    .quad   .L53
    .quad   .L52
    .quad   .L51
    .quad   .L50
    .quad   .L49
    .quad   .L48
    .quad   .L47
    .quad   .L45
    .text
    .p2align 4,,10
    .p2align 3
.L47:
    movq    b(%rip), %rax
    addq    $41, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L48:
    movq    b(%rip), %rax
    addq    $37, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L49:
    movq    b(%rip), %rax
    addq    $31, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L50:
    movq    b(%rip), %rax
    addq    $29, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L51:
    movq    b(%rip), %rax
    addq    $23, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L52:
    movq    b(%rip), %rax
    addq    $19, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L53:
    movq    b(%rip), %rax
    addq    $17, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L54:
    movq    b(%rip), %rax
    addq    $13, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L45:
    movq    b(%rip), %rax
    addq    $43, %rax
    jmp .L43

Yes gcc did catch __builtin_unreachable, but for some reason, there is an unnecessary jump before each return, and the jump table has a duplicate entry of L45. Also instead of simply addq $N, b(%rip), it keeps writing movq b(%rip), %rax, addq $N, %rax, then movq %rax, b(%rip) before return.

What has made gcc produce apparently dumb code?

The binary was compiled with -O3 under Fedora Linux, and the gcc version I'm using is 11.2.1 20211203


Solution

  • Here's the best explanation I can come up with.

    The compiler can evidently do (at least somewhat) an optimization where code that is common to all branches of the if/else tree can be factored out (hoisted or sunk as appropriate). But in the f0 version, this optimization can't be applied because the "default" case has no code at all, and in particular does not either load or store b. So the compiler just optimizes the cases individually as best it can, leaving each one as a single RMW add-memory instruction.

    In the f1 version, your __builtin_unreachable has removed the default branch. So now every branch consists, conceptually, of a load of b, an add of some constant, and a store back to b. The compiler seems to notice that they all have the store in common, and thus sinks it out - the store instruction appears just once, and every case jumps to it. Unfortunately, this actually results in worse code overall, because now the individual cases can't use the RMW add; they have to do the load and add as separate instructions. Moreover, the cases can no longer just ret by themselves; they all have to jump to the factored-out store. And the compiler has somehow not realized that the load could be hoisted out, so it is needlessly duplicated across all the cases.

    I would guess part of the issue is that the hoisting/sinking is done in a target-independent pass that treats load, add, and store as independent operations. If they remain together, then some later target-specific peephole pass may combine them into the single add-memory instruction; but the earlier pass doesn't seem to consider that leaving them together might be advantageous, and thinks that any hoisting must be good. On a RISC-type load/store machine, where RMW always has to be three instructions, sinking just the store would still be somewhat helpful, maybe, but for x86 it definitely isn't.

    So it's maybe two separate missed-optimization problems. The first is not noticing that the load can be hoisted (or maybe noticing but deciding not to do it), and that one seems like a clear bug. The second is not properly evaluating whether sinking the store is worth the cost of the extra jump, and that may be more a matter of applying heuristics that happen to be wrong in this case.