Search code examples
loopsassemblyprogramming-languagesmicro-optimizationbranch-prediction

Is there a loop construct that repeats n times without calculating some conditional?


This question arose out of the context of optimizing code to remove potential branch prediction failures.. in fact, removing branches all together.

For my example, a typical for-loop uses the following syntax:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main()
{
    bool *S = calloc(N + 1, sizeof(bool));
    int p = 2;
    for (int i = p * p; i <= N; i += p)
        S[i] = 1;

    return 0;
}

As I understand it, the assembly code produced would contain some sort of JMP instruction to check whether i <= N.

The only way I can conceive doing this in assembly would be to just repeat the same assembly instructions n times with i being incremented in each repetition. However, for large n, the binary produced would be huge.

So, I'm wondering, is there some loop construct that repeats n times without calculating some conditional?


Solution

  • Fully unrolling is the only practical way, and you're right that it's not appropriate for huge iteration counts (and usually not for loops where the iteration count isn't a compile-time constant). It's great for small loops with a known iteration count.

    For your specific case, of setting memory to 1 over a given distance, x86 has rep stosb, which implements memset in microcode. It doesn't suffer from branch misprediction on current Intel CPUs, because microcode branches can't be predicted / speculatively executed and stall instead (or something), which means it has about 15 cycles of startup overhead before doing any stores. See What setup does REP do? For large aligned buffers, rep stosb/d/q is pretty close to an optimized AVX loop.

    So you don't get a mispredict, but you do get a fixed startup overhead. (I think this stalls issue / execute of instructions after the rep stos, because the microcode sequencer takes over the front end until it's done issuing all the uops. And it can't know when it's done until it's executed some that look at rcx. Issue happens in-order, so later independent instructions can't even get into the out-of-order part of the core until after rep stos figures out which uops to issue. The stores don't have to execute, but the microcode branches do.) Icelake is supposed to have "fast short REP MOV", which may finally solve the startup-overhead issue. Possibly by adding a dedicated hardware state-machine for rep movs and rep stos, like @KrazyGlew wishes he had when designing fast-strings in Intel's P6 uarch (ppro/PII/PIII), the ancestor of current Intel CPUs which still use very similar microcoded implementations. AFAIK, AMD is similar, but I haven't seen numbers or details for their rep stos startup overhead.

    Most architectures don't have a single-instruction memset like that, so x86 is definitely a special case in computer architecture. But some old computers (like Atari ST) could have a "blitter" chip to offload copying memory, especially for graphics purposes. They'd use DMA to do the copying separately from the CPU altogether. This would be very similar to building a hardware state machine on-chip to run rep movs.


    Misprediction of normal loop branches

    Consider a normal asm loop, like

    .looptop:                do {
        # loop body includes a pointer increment
        # may be unrolled some to do multiple stores per conditional branch
        cmp   rdi, rdx
        jb    .looptop       } while(dst < end_dst);
    

    The branch ends up strongly predicted-taken after the loop has run a few time.

    For large iteration counts, one branch mispredict is amortized over all the loop iterations and is typically negligible. (The conditional branch at the bottom of the loop is predicted taken, so the loop branch mispredicts the one time it's not taken.)

    Some branch predictors have special support for loop branches, with a pattern predictor that can count patterns like taken 30 times / not taken. With some large limit for the max iteration count they can correctly predict.

    Or a modern TAGE predictor (like in Intel Sandybridge-family) uses branch history to index the entry, so it "naturally" gives you some pattern prediction. For example, Skylake correctly predicts the loop-exit branch for inner loops up to about 22 iterations in my testing. (When there's a simple outer loop that re-runs the inner loop with the same iteration count repeatedly.) I tested in asm, not C, so I had control of how much unrolling was done.

    A medium-length loop that's too long for the exit to be predicted correctly is the worst case for this. It's short enough that a mispredict on loop exit happens frequently, if it's an inner loop that repeatedly runs ~30 iterations on a CPU that can't predict that, for example.

    The cost of one mispredict on the last iteration can be quite low. If out-of-order execution can "see" a few iterations ahead for a simple loop counter, it can have the branch itself executed before spending much time on real work beyond that mispredicted branch. With fast recovery for branch misses (not flushing the whole out-of-order pipeline by using something like a Branch Order Buffer), you still lose the chance to get started on independent work after the loop for a few cycles. This is most likely if the loop bottlenecks on latency of a dependency chain, but the counter itself is a separate chain. This paper about the real cost of branch misses is also interesting, and mentions this point.

    (Note that I'm assuming that branch prediction is already "primed", so the first iteration correctly predicts the loop branch as taken.)


    Impractical ways:

    @Hadi linked an amusing idea: instead of running code normally, compile in a weird way where control and instructions are all data, like for example x86 using only mov instructions with x86 addressing modes + registers. (and an unconditional branch at the bottom of a big block). Is it possible to make decisions in assembly without using `jump` and `goto` at all? This is hilariously inefficient, but doesn't actually have any conditional branches: everything is a data dependency.

    It uses different forms of MOV (between registers, and immediate to register, as well as load/store), so it's not a one-instruction-set computer.

    A less-insane version of this is an interpreter: Different instructions / operations in the code being interpreted turn into control dependencies in the interpreter (creating a hard-to-solve efficiency problem for the interpret, see Darek Mihocka's "The Common CPU Interpreter Loop Revisited" article, but data and control flow in the guest code are both data in the interpreter. The guest program counter is just another piece of data in the interpreter.