I'm doing a cycle to sum two arrays. My objective is do it by avoiding carry checks c = a + b; carry = (c<a)
. I lost the CF
when I do the loop test, with the cmp
instruction.
Currently, i am using and the JE
and STC
to test and set the previously saved state of CF
. But the jump takes more less 7 cycles, what it is a lot for what I want.
//This one is working
asm(
"cmp $0,%0;"
"je 0f;"
"stc;"
"0:"
"adcq %2, %1;"
"setc %0"
: "+r" (carry), "+r" (anum)
: "r" (bnum)
);
I already tried use the SAHF
(2 + 2(mov) cycles), but that do not worked.
//Do not works
asm(
"mov %0, %%ah;"
"sahf;"
"adcq %2, %1;"
"setc %0"
: "+r" (carry), "+r" (anum)
: "r" (bnum)
);
Anyone knows a way to set the CF
more quickly? Like a direct move or something similar..
Looping without clobbering CF
will be faster. See that link for some better asm loops.
Don't try to write just the adc
with inline asm inside a C loop. It's impossible for that to be optimal, because you can't ask gcc not to clobber flags. Trying to learn asm with GNU C inline asm is harder than writing a stand-alone function, esp. in this case where you are trying to preserve the carry flag.
You could use setnc %[carry]
to save and subb $1, %[carry]
to restore. (Or cmpb $1, %[carry]
I guess.) Or as Stephen points out, negb %[carry]
.
0 - 1
produces a carry, but 1 - 1
doesn't.
Use a uint8_t
to variable to hold the carry, since you will never add it directly to %[anum]
. This avoids any chance of partial-register slowdowns. e.g.
uint8_t carry = 0;
int64_t numa, numb;
for (...) {
asm ( "negb %[carry]\n\t"
"adc %[bnum], %[anum]\n\t"
"setc %[carry]\n\t"
: [carry] "+&r" (carry), [anum] "+r" (anum)
: [bnum] "rme" (bnum)
: // no clobbers
);
}
You could also provide an alternate constraint pattern for register source, reg/mem dest. I used an x86 "e"
constraint instead of "i"
, because 64bit mode still only allows 32bit sign-extended immediates. gcc will have to get larger compile-time constants into a register on its own. Carry is early-clobbered, so even if it and bnum
were both 1
to start with, gcc couldn't use the same register for both inputs.
This is still terrible, and increases the length of the loop-carried dependency chain from 2c to 4c (Intel pre-Broadwell), or from 1c to 3c (Intel BDW/Skylake, and AMD).
So your loop runs at 1/3rd speed because you're using a kludge instead of writing the whole loop in asm.
A previous version of this answer suggested adding the carry directly, instead of restoring it into CF
. This approach has a fatal flaw: it mixed up the incoming carry into this iteration with the outgoing carry going to the next iteration.
Also, sahf
is Set AH from Flags. lahf
is Load AH into Flags (and it operates on the whole low 8 bits of flags. Pair those instructions; don't use lahf
on a 0 or 1 that you got from setc
.
Read the insn set reference manual for any insns that don't seem to be doing what you expect. See https://stackoverflow.com/tags/x86/info