I have a loop I use to add numbers with carry.
I'm wondering whether having .done:
align would give me anything? After all, it will branch there only once per call to the function. I know that a C compiler is likely to align all branches affected by a loop. But I'm thinking that it should not cause any penalty (especially since we have rather large instruction caches now a day).
// // corresponding C function declaration
// int add(uint64_t * a, uint64_t const * b, uint64_t const * c, uint64_t size);
//
// Compile with: gcc -c add.s -o add.o
//
// WARNING: at this point I've not worked on the input registers & registers to save
// do not attempt to use in your C program with this very code.
.text
.p2align 4,,15
.globl add
.type add, @function
add:
test %rcx, %rcx
je .done
clc
xor %rbp, %rbp
.p2align 4,,10
.p2align 3
.loop:
mov (%rax, %rbp, 8), %rdx
adc (%rbx, %rbp, 8), %rdx
mov %rdx, (%rdi, %rbp, 8)
inc %rbp
dec %rcx
jrcxz .done
jmp .loop
// -- is alignment here necessary? --
.done:
setc %al
movzx %al, %rax
ret
Is there clear documentation about this specific case by Intel or AMD?
I actually decided to simplify by removing the loop as I only have 3 sizes (128, 256, and 512) so it's easy enough to write an unrolled loop. However, I only need an add so I don't really want to use the GMP for this.
Here is the final code which should work in you C program. This one is for 512 bits, specifically. Just use three of the add_with_carry for 256 bits and just one for the 128 bits versions.
// // corresponding C function declaration
// void add512(uint64_t * dst, uint64_t const * src);
//
.macro add_with_carry offset
mov \offset(%rsi), %rax
adc %rax, \offset(%rdi)
.endm
.text
.p2align 4,,15
.globl add512
.type add512, @function
add512:
mov (%rsi), %rax
add %rax, (%rdi)
add_with_carry 8
add_with_carry 16
add_with_carry 24
add_with_carry 32
add_with_carry 40
add_with_carry 48
add_with_carry 56
ret
Note that I do not need the clc
since I use add
the first time (carry is ignored). I also made it to add to the destination (i.e. dest[n] += src[n]
in C) because I'm not likely to need a copy in my code.
The offsets allow me to not increment the pointers and they only use one extra byte per add.
Holy clock cycles Batman, you're asking about efficiency when you used jrcxz
over a jmp
instead of just jnz
after dec
?
You'd only consider the slow loop
or somewhat-slow jrcxz
if you were avoiding FLAGS writes entirely by using lea 1(%rcx), %rcx
. dec
writes all flags except CF, which used to lead to partial-flag stalls in ADC loops on CPUs before Sandybridge, but now it's fine. A dec/jnz
loop is ideal for ADC loops on modern CPUs. You might want to avoid an indexed addressing mode for the adc
and/or the store (possibly with loop unrolling) so the adc
can micro-fuse the load, and so the store-address uop can run on port 7 on Haswell and later. You can index the mov
load relative to one of the other pointers which you increment with LEA.
But anyway no, alignment for never-taken branch targets is irrelevant. So is alignment for the fall-through path of a branch that always falls through, other than usual code-alignment / decoder effects.
Alignment for rarely taken branch targets isn't a big deal either; the penalty is maybe an extra cycle in the front-end, or fewer instructions ready for pre-decode in a clock cycle. So we're talking about something like 1 clock cycle early in the front end in the case where that path actually executes. This is why aligning the top of a loop used to matter, especially on CPUs without loop buffers. (And/or without uop caches and other things that hide front-end bubbles except in rare cases.).
Correct branch prediction will typically hide that 1 cycle, but usually leaving a loop results in an incorrectly predicted branch unless the iteration count is small and the same every time. That first cycle might only fetch one useful instruction near the end of a fetch-block of 16 bytes (or even zero if the first instruction is split across the 16-byte boundary), with later instructions only getting loaded in the next cycle. See https://agner.org/optimize/ for Agner Fog's microarch guide and asm optimization guide. IDK how recently he's updated alignment guidelines in the asm optimization manual; I mostly just look at his updates to the microarch guide for new microarchitectures.
In general uop caches and buffers between pipeline stages make code alignment a lot less of a big deal than it used to be. Aligning the tops of loops by 8 or 16 can still be a good idea, but otherwise it's often not worth putting an extra nop
anywhere that will be executed.
You can imagine cases where it might have a bigger effect, like if previous code never executes, alignment to a cache-line or page boundary could avoid touching an otherwise-cold cache-line or page. That can't happen with your code; there are "hot" instructions less than 64 bytes before your jump target. But that's a different kind of effect from the usual goal of code alignment.
More code review:
RBP
is a call-preserved register. If you want to call this from C, pick a register like RA/C/DX, RSI, RDI, or R8..R11, that you aren't using for anything. Or for Windows x64, there are even fewer call-clobbered "legacy" regs (that don't need a REX prefix). Looks like all your loop instructions need a REX prefix for 64-bit operand-size.
clc
is unnecessary: xor %ebp, %ebp
to zero RBP already clears CF. Speaking of which, 32-bit operand-size is more efficient for xor-zeroing. It saves code-size.
You could also avoid the dec
in your loop by indexing from the end of your arrays, with a negative index that counts up towards zero. e.g. rdi += len; rsi += len;
and so on. RCX = -len
. So inc %rcx / jnz
works as your loop condition, and as your index increment.
But like I said above, you might instead be better off with lea
for a pointer increment and index your other arrays relative to that. (p1 -= p2
, then use *(p1 + p2)
and *p2
, and increment both with one p2++
in asm.) So you might still want a separate counter.
You can call GMP library functions instead of writing your own extended-precision loops. They have hand-tuned asm for many different x86 microarchitectures with loop unrolling and so on.