Search code examples
cpubitenumerate

Time to enumerate all values in n bits


I am looking for the theoretical time it would take to go through all the possible values that can be encoded in n bits, with n being big (say, 80). I understand this would require a few instructions i.e. a test, an addition and a jump. But I can't figure out how the CPU will manage these operations on such a number and what will become the total number of cycles per incrementation.

Any help appreciated, thanks!


Solution

  • Summary: one clock cycle per value, storing the result to memory, is pretty easy on an out-of-order execution CPU. Anything more than storing the results will be more overhead than incrementing.

    Instruction throughput/latency and CPU execution pipeline info from Agner Fog.


    80 bits is larger than a single register on a normal 64bit CPU, so we'll need a pair of registers to hold the values. i.e.

    uint64_t high, low;  // or a struct, or a 2-element array
    

    Regardless of being separate variables, or struct fields, or array elements, any decent compiler will keep them in registers. high could be a 32bit variable.

    There are two ways to go about incrementing this pair from 0 to 2^80-1:

    • slow way: low+=1, high = add-with-cary(high, 0).

    • fast way: double-nested loop:

    low = 0;
    for (high = 0 ; high < 1UL << (80-64) ; high++) {
        do {
            use_value(high,low);  // this point sees all 2^64 values of low
            low++;
        } while (low != 0);
        // low has wrapped to zero, so we don't even have to re-zero it
    }
    

    If you have to do this with 32bit code, use a triple-nested loop. (using uint64_t in 32bit code probably does it the slow way, with an add-carry every time.)

    This should compile to an inner-loop that takes only one cycle per iteration. (Intel Haswell can do one-per-cycle loop iterations, but Intel Sandybridge only has a take-branch throughput of one per 2 cycles.)

    I put this on godbolt, with store(memory_order_relaxed) to atomic globals as my use(high,low), to prevent the loop from optimizing away. A store has less overhead than a function call, and doesn't require mov instructions to get the variables into the correct registers. It does in fact compile to a fairly good loop that should run one-per-clock on most CPUs.

    When I had low++ before the atomic store, it used an extra test instruction, which is dumb. It should just do:

    loop80:
        xor edx, edx    # high
        xor eax, eax    # low
    .Louter_loop:
        mov QWORD PTR global_h[rip], rdx    # or this could just be a 32bit store
    .Linner_loop:
        mov QWORD PTR global_l[rip], rax    #,, low
        inc rax     # shorter encoding than add rax, 1
        jne .Linner_loop      # flags set by inc
     # inner loop is 2 fused-domain uops.  Or on AMD, 3 macro-ops since it can't fuse ALU ops with jcc
    
        inc edx     # only need to use the 32bit reg.
        cmp edx, 65536   # or count down from 65535 to zero, saving the cmp insn
        jne .Louter_loop
    
        ret
    

    With loop-unrolling (gcc -O3 -funroll-loops), gcc and clang can use lea instructions to do low+1, low+2, low+3 etc. in parallel, so the loop-carried dependency is only low+=8. This greatly lowers the branch throughput needed, but now the bottleneck becomes store throughput (one per clock).

    If we left out the store, we could generate maybe 3 low values per clock in registers. (Intel SnB and Haswell can run LEA on two execution ports, and fused arithmetic+branch on a 3rd port.) This leaves no execution resources free to do anything with the values, but all you asked was enumerating them. LEA is useful as a non-destructive 3-operand add operation. Beating that throughput would require more LEA throughput, or a different CPU architecture. e.g. most RISC CPUs (ARM, PowerPC, etc.) have 3-operand instructions for most ALU ops, instead of having dest = one of the src operands. I haven't looked at instruction throughputs on the POWER architecture (and I find the instruction mnemonics really hard to read...), but it might be able to do 4 adds in parallel. If so, it could beat Intel's x86 implementations on this microbenchmark.

    Notice how we're not incrementing the global variable. Having a store->load dependency in the dep chain would make it at least 6 cycles long. Atomic increment (with lock inc [mem]) takes quite a bit longer.


    So if you need to do anything with each bit pattern, even just store it in memory, that will dominate the running time. If you don't need to store it, an out-of-order execution CPU running good code can generate multiple values in parallel per clock. (Intel and AMD's current implementations can only store once per clock.)


    If you want to start getting silly, it would be possible to use vector instructions to increment 16 (or 32, or 64 with AVX512) 8bit values and do a 16B (or 32B or 64B) store of that vector once per clock. That wouldn't produce a contiguous 80bit value in memory, though, which is why I call it silly.