Search code examples
c++assemblyinline-assemblycarryflag

Fastest way to set a Carry Flag


I'm doing a cycle to sum two arrays. My objective is do it by avoiding carry checks c = a + b; carry = (c<a). I lost the CF when I do the loop test, with the cmp instruction.

Currently, i am using and the JEand STC to test and set the previously saved state of CF. But the jump takes more less 7 cycles, what it is a lot for what I want.

   //This one is working
   asm(
        "cmp $0,%0;"
        "je 0f;"
        "stc;"
    "0:"   
        "adcq %2, %1;"
        "setc %0"

    : "+r" (carry), "+r" (anum)
    : "r" (bnum)
   );

I already tried use the SAHF (2 + 2(mov) cycles), but that do not worked.

   //Do not works
   asm(
        "mov %0, %%ah;"
        "sahf;"
        "adcq %2, %1;"
        "setc %0"

        : "+r" (carry), "+r" (anum)
        : "r" (bnum)
   );

Anyone knows a way to set the CF more quickly? Like a direct move or something similar..


Solution

  • Looping without clobbering CF will be faster. See that link for some better asm loops.

    Don't try to write just the adc with inline asm inside a C loop. It's impossible for that to be optimal, because you can't ask gcc not to clobber flags. Trying to learn asm with GNU C inline asm is harder than writing a stand-alone function, esp. in this case where you are trying to preserve the carry flag.

    You could use setnc %[carry] to save and subb $1, %[carry] to restore. (Or cmpb $1, %[carry] I guess.) Or as Stephen points out, negb %[carry].

    0 - 1 produces a carry, but 1 - 1 doesn't.

    Use a uint8_t to variable to hold the carry, since you will never add it directly to %[anum]. This avoids any chance of partial-register slowdowns. e.g.

    uint8_t carry = 0;
    int64_t numa, numb;
    
    for (...) {
        asm ( "negb   %[carry]\n\t"
              "adc    %[bnum], %[anum]\n\t"
              "setc   %[carry]\n\t"
              : [carry] "+&r" (carry), [anum] "+r" (anum)
              : [bnum] "rme" (bnum)
              : // no clobbers
            );
    }
    

    You could also provide an alternate constraint pattern for register source, reg/mem dest. I used an x86 "e" constraint instead of "i", because 64bit mode still only allows 32bit sign-extended immediates. gcc will have to get larger compile-time constants into a register on its own. Carry is early-clobbered, so even if it and bnum were both 1 to start with, gcc couldn't use the same register for both inputs.

    This is still terrible, and increases the length of the loop-carried dependency chain from 2c to 4c (Intel pre-Broadwell), or from 1c to 3c (Intel BDW/Skylake, and AMD).

    So your loop runs at 1/3rd speed because you're using a kludge instead of writing the whole loop in asm.


    A previous version of this answer suggested adding the carry directly, instead of restoring it into CF. This approach has a fatal flaw: it mixed up the incoming carry into this iteration with the outgoing carry going to the next iteration.


    Also, sahf is Set AH from Flags. lahf is Load AH into Flags (and it operates on the whole low 8 bits of flags. Pair those instructions; don't use lahf on a 0 or 1 that you got from setc.

    Read the insn set reference manual for any insns that don't seem to be doing what you expect. See https://stackoverflow.com/tags/x86/info