loops branch x86-64 memory-alignment micro-optimization

Is there a penalty in having a non-aligned Jcc which is nearly never taken in Intel/AMD 64?

I have a loop I use to add numbers with carry.

I'm wondering whether having .done: align would give me anything? After all, it will branch there only once per call to the function. I know that a C compiler is likely to align all branches affected by a loop. But I'm thinking that it should not cause any penalty (especially since we have rather large instruction caches now a day).

//    // corresponding C function declaration
//    int add(uint64_t * a, uint64_t const * b, uint64_t const * c, uint64_t size);
//
// Compile with:   gcc -c add.s -o add.o
//
// WARNING: at this point I've not worked on the input registers & registers to save
//          do not attempt to use in your C program with this very code.

    .text
    .p2align    4,,15
    .globl      add
    .type       add, @function
add:
    test        %rcx, %rcx
    je          .done
    clc
    xor         %rbp, %rbp

    .p2align    4,,10
    .p2align    3
.loop:
    mov         (%rax, %rbp, 8), %rdx
    adc         (%rbx, %rbp, 8), %rdx
    mov         %rdx, (%rdi, %rbp, 8)
    inc         %rbp
    dec         %rcx
    jrcxz       .done
    jmp         .loop

    // -- is alignment here necessary? --
.done:
    setc        %al
    movzx       %al, %rax
    ret

Is there clear documentation about this specific case by Intel or AMD?

I actually decided to simplify by removing the loop as I only have 3 sizes (128, 256, and 512) so it's easy enough to write an unrolled loop. However, I only need an add so I don't really want to use the GMP for this.

Here is the final code which should work in you C program. This one is for 512 bits, specifically. Just use three of the add_with_carry for 256 bits and just one for the 128 bits versions.

//    // corresponding C function declaration
//    void add512(uint64_t * dst, uint64_t const * src);
//

    .macro add_with_carry offset
        mov         \offset(%rsi), %rax
        adc         %rax, \offset(%rdi)
    .endm

    .text
    .p2align    4,,15
    .globl      add512
    .type       add512, @function
add512:
    mov         (%rsi), %rax
    add         %rax, (%rdi)

    add_with_carry 8
    add_with_carry 16
    add_with_carry 24
    add_with_carry 32
    add_with_carry 40
    add_with_carry 48
    add_with_carry 56

    ret

Note that I do not need the clc since I use add the first time (carry is ignored). I also made it to add to the destination (i.e. dest[n] += src[n] in C) because I'm not likely to need a copy in my code.

The offsets allow me to not increment the pointers and they only use one extra byte per add.

Solution

Holy clock cycles Batman, you're asking about efficiency when you used jrcxz over a jmp instead of just jnz after dec?

You'd only consider the slow loop or somewhat-slow jrcxz if you were avoiding FLAGS writes entirely by using lea 1(%rcx), %rcx. dec writes all flags except CF, which used to lead to partial-flag stalls in ADC loops on CPUs before Sandybridge, but now it's fine. A dec/jnz loop is ideal for ADC loops on modern CPUs. You might want to avoid an indexed addressing mode for the adc and/or the store (possibly with loop unrolling) so the adc can micro-fuse the load, and so the store-address uop can run on port 7 on Haswell and later. You can index the mov load relative to one of the other pointers which you increment with LEA.

But anyway no, alignment for never-taken branch targets is irrelevant. So is alignment for the fall-through path of a branch that always falls through, other than usual code-alignment / decoder effects.

Alignment for rarely taken branch targets isn't a big deal either; the penalty is maybe an extra cycle in the front-end, or fewer instructions ready for pre-decode in a clock cycle. So we're talking about something like 1 clock cycle early in the front end in the case where that path actually executes. This is why aligning the top of a loop used to matter, especially on CPUs without loop buffers. (And/or without uop caches and other things that hide front-end bubbles except in rare cases.).

Correct branch prediction will typically hide that 1 cycle, but usually leaving a loop results in an incorrectly predicted branch unless the iteration count is small and the same every time. That first cycle might only fetch one useful instruction near the end of a fetch-block of 16 bytes (or even zero if the first instruction is split across the 16-byte boundary), with later instructions only getting loaded in the next cycle. See https://agner.org/optimize/ for Agner Fog's microarch guide and asm optimization guide. IDK how recently he's updated alignment guidelines in the asm optimization manual; I mostly just look at his updates to the microarch guide for new microarchitectures.

In general uop caches and buffers between pipeline stages make code alignment a lot less of a big deal than it used to be. Aligning the tops of loops by 8 or 16 can still be a good idea, but otherwise it's often not worth putting an extra nop anywhere that will be executed.

You can imagine cases where it might have a bigger effect, like if previous code never executes, alignment to a cache-line or page boundary could avoid touching an otherwise-cold cache-line or page. That can't happen with your code; there are "hot" instructions less than 64 bytes before your jump target. But that's a different kind of effect from the usual goal of code alignment.

More code review:

RBP is a call-preserved register. If you want to call this from C, pick a register like RA/C/DX, RSI, RDI, or R8..R11, that you aren't using for anything. Or for Windows x64, there are even fewer call-clobbered "legacy" regs (that don't need a REX prefix). Looks like all your loop instructions need a REX prefix for 64-bit operand-size.

clc is unnecessary: xor %ebp, %ebp to zero RBP already clears CF. Speaking of which, 32-bit operand-size is more efficient for xor-zeroing. It saves code-size.

You could also avoid the dec in your loop by indexing from the end of your arrays, with a negative index that counts up towards zero. e.g. rdi += len; rsi += len; and so on. RCX = -len. So inc %rcx / jnz works as your loop condition, and as your index increment.

But like I said above, you might instead be better off with lea for a pointer increment and index your other arrays relative to that. (p1 -= p2, then use *(p1 + p2) and *p2, and increment both with one p2++ in asm.) So you might still want a separate counter.

You can call GMP library functions instead of writing your own extended-precision loops. They have hand-tuned asm for many different x86 microarchitectures with loop unrolling and so on.