c performance assembly optimization x86-64

Performance of duplicating vs deduplicating identical conditional code in loops

Say I have code like this, call it version 1:

while (some_condition) {
    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        // .. rare work here ..
        continue;
    }

    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        // .. rare work here ..
        continue;
    }

    // .. more work
}

Assume that the "rare work" is identical across both cases. We could equivalently write version 2:

while (some_condition) {
    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        goto rare_work;
    }

    // .. work that may trigger rare_condition ...

    if (rare_condition) {
        goto rare_work;
    }

    // .. more work
    continue;

  rare_work:
    // .. rare work here ..
}

The compiler should usually be smart enough to make the if() checks result in a jump directly to rare_work, rather than a jump to a block containing a jump to rare_work. The compiler may even transform version 1 to version 2 for us, because it reduces the overall amount of assembly, increasing the chance of fitting in instruction cache.

My question is: if instruction cache is not an issue, is there any reason to expect one or the other to perform better? This is assuming we care about low level micro-optimization and are willing to write the code in assembly, if we have to in order to force getting one version or the other. I realize things like port availability could change based on what the work instructions actually are, I'm just wondering if there is any high level reason to expect a difference before doing such an analysis.

Solution

Lundin's answer shows that compilers can make good asm with fully structured programming, avoiding even a continue which could avoid putting the second half inside an else.

Factoring your code out into a helper function may require passing addresses of locals, which hopefully the compiler will see through after inlining, and not actually lead to worse code-gen. Even though the code is rare, it's probably better for it to be an inline part of this function in some out-of-the-way part, rather than an actual runtime call, especially if it needs read/write access to some local vars.

With the 2x if()goto version, the compiler could do an optimization similar to "tail duplication" to turn it into the first version. (The reverse might be less likely, although identical-code folding is possible if you write the same operations twice.)

Or into asm shaped like @EricPostpischil's version with a goto label in one if body if it chose to lay out the asm so the rare_work block was the fall-through path from one of the branches. But it's rare so we actually don't want that in the asm, we'd rather the common case was fall-through¹, not jumping over the rare block.

The C that's closest to the asm you want is 2x if()goto, which could naturally compile to an extra block past the end of the function which jumps back into the loop. (Or at least somewhere that doesn't require a jmp inside the loop to skip it).

Whether real compilers see it that way or would transform it significantly is another matter; gotta check the asm. @JérômeRichard's suggestion in comments of if(unlikely(rare_condition)) goto is a good idea if compilers don't put the block of rare work out-of-line.

Footnote 1:

Separate from branch prediction, a taken branch can reduce code-fetch bandwidth. e.g. in Intel CPUs, an unconditional branch ends a uop cache line. (And a predicted-taken conditional branch probably goes to another uop cache line anyway.)

Or if running from legacy decode (or on a non-x86 that doesn't need a uop cache), a taken branch makes later bytes of machine code useless in a 16 or 32 byte block fetched in one cycle. So for code-fetch bandwidth (and I-cache locality if the condition is truly rare), you want the common case to have mostly not-taken conditional branches.

As @JérômeRichard mentioned, never-taken branches might not even pollute branch-prediction for other branches.

The fast path should ideally be a straight line. This is the main benefit of likely/unlikely hints, not hinting the actual runtime branch-predictor which isn't possible on most ISAs. (See Is it possible to tell the branch predictor how likely it is to follow the branch? and a specific example of GCC output in What is the advantage of GCC's __builtin_expect in if else statements?)

Another aspect of likely/unlikely is hinting the compiler that the branch is predictable at all, so it's less likely to use cmov with a data dependency instead of branching.