Using intel inline assembler to code bigint add with carry

I would like to do a fast code for adding 64 bit numbers in big ints:

uint64_t ans[n];
uint64_t a[n], b[n]; // assume initialized values....
for (int i = 0; i < n; i++)
  ans[i] = a[i] + b[i];

but the above does not work with carry.

I saw another question here that suggested using an if statement to check which is elegant:

ans[0] = a[0] + b[0];
int c = ans[0] < a[0];
for (int i = 0; i < n; i++) {
  ans[i] = a[i] + b[i] + c;
  c = ans[i] < a[i];
}

However, I would like to learn how to embed inline (intel) assembly and do it faster. I am sure there are 64 bit opcodes, the equivalent of:

add eax, ebx
adc ...

but I don't know how to pass parameters to the assembler from the rest of the c++ code.

Solution

but the above does not work with carry.

If you mean that GCC does not generate code that uses the ADC instruction, that's because its optimizer has determined that there is a more optimal way to implement the addition.

Here is my test version of your code. I have extracted the arrays out as parameters passed to the function so that the code cannot be elided and we can limit our study to the relevant portions.

void Test(uint64_t* a, uint64_t* b, uint64_t* ans, int n)
{
    for (int i = 0; i < n; ++i)
    {
        ans[i] = a[i] + b[i];
    }
}

Now, indeed, when you compile this with a modern version of GCC and look at the disassembly, you'll see a bunch of crazy-looking code.

The Godbolt compiler explorer is helpful enough that it color-codes lines of C source and their corresponding assembly code (or at least, it does so to the best of its ability; this isn't perfect in optimized code, but it works well enough here). The purple code is what implements the 64-bit addition in the inner body of the loop. GCC is emitting SSE2 instructions to do the addition. Specifically, you can pick out MOVDQU (which does an unaligned move of a double quadword from memory into an XMM register), PADDQ (which does an addition on packed integer quadwords), and MOVQ (which moves a quadword from an XMM register into memory). Roughly speaking, for a non-assembly expert, MOVDQU is how it loads the 64-bit integer values, PADDQ does the addition, and then MOVQ stores the result.

Part of what makes this output especially noisy and confusing is that GCC is unrolling the for loop. If you disable loop unrolling (-fno-tree-vectorize), you get output that is easier to read, although it's still doing the same thing using the same instructions. (Well, mostly. Now it's using MOVQ everywhere, for both loads and stores, instead of loading with MOVDQU.)

On the other hand, if you specifically forbid the compiler from using SSE2 instructions (-mno-sse2), you see output that is significantly different. Now, because it can't use SSE2 instructions, it's emitting basic x86 instructions to do the 64-bit addition—and the only way to do it is ADD + ADC.

I suspect that this is the code you were expecting to see. Clearly, GCC believes that vectorizing the operation results in faster code, so this is what it does when you compile with the -O2 or -O3 flags. At -O1, it always uses ADD + ADC. This is one of those cases where fewer instructions does not imply faster code. (Or at least, GCC doesn't think so. Benchmarks on actual code might tell a different story. The overhead may be significant in certain contrived scenarios but irrelevant in the real world.)

For what it's worth, Clang behaves in very similar ways as GCC does here.

If you meant that this code doesn't carry the result of the previous addition over to the next addition, then you're right. The second snippet of code that you've shown implements that algorithm, and GCC does compile this using the ADC instruction.

At least, it does when targeting x86-32. When targeting x86-64, where you have native 64-bit integer registers available, no "carrying" is even necessary; simple ADD instructions are sufficient, requiring significantly less code. In fact, this is only "bigint" arithmetic on 32-bit architectures, which is why I have assumed x86-32 in all of the foregoing analysis and compiler output.

In a comment, Ped7g wonders why compilers don't seem to have the idea of the ADD+ADC chain idiom. I'm not entirely sure what he's referring to here, since he didn't share any examples of the input code that he tried, but as I've shown, compilers do use ADC instructions here. However, compilers don't chain carries across loop iterations. This is too difficult to implement in practice, because so many instructions clear the flags. Someone hand-writing the assembly code might be able to do it, but compilers won't.

(Note that c should probably be an unsigned integer to encourage certain optimizations. In this case, it just ensures that GCC uses an XOR instruction when preparing to do a 64-bit addition instead of a CDQ. Although slightly faster, not a huge improvement, but mileage may vary with real code.)

(Also, it's disappointing that GCC is unable to emit branchless code for setting c inside of the loop. With sufficiently random input values, branch prediction will fail, and you'll end up with relatively inefficient code. There are almost certainly ways of writing the C source to persuade GCC to emit branchless code, but that's an entirely different answer.)

I would like to learn how to embed inline (intel) assembly and do it faster.

Well, we've already seen that it might not necessarily be faster if you naïvely caused a bunch of ADC instructions to be emitted. Don't hand-optimize unless you are confident that your assumptions about performance are correct!

Also, not only is inline assembly difficult to write, debug, and maintain, but it may even make your code slower because it inhibits certain optimizations that could otherwise have been done by the compiler. You need to be able to prove that your hand-coded assembly is a significant enough performance win over what the compiler would generate that these considerations become less relevant. You also should confirm that there is no way to get the compiler to generate code that is close to your ideal output, either by altering flags or cleverly writing the C source.

But if you really wanted to, you could read any of a variety of online tutorials that teach you how to use GCC's inline assembler. This is a pretty good one; there are plenty of others. And of course, there is the manual. All will explain how "extended asm" allows you to specify input operands and output operands, which will answer your question of "how to pass parameters to the assembler from the rest of the c++ code".

As paddy and Christopher Oicles suggested, you should prefer intrinsics to inline assembly. Unfortunately, there are no intrinsics that cause ADC instructions to be emitted. Inline assembly is your only recourse there—that, or what I already suggested of writing the C source so that the compiler will do the Right Thing™ on its own.

There are _addcarry_u32 and _addcarry_u64 intrinsics, though. These cause ADCX or ADOX instructions to be emitted. These are "extended" versions of ADC that can produce more efficient code. They are part of the Intel ADX instruction set, introduced with the Broadwell microarchitecture. In my opinion, Broadwell does not have sufficiently high market penetration that you can simply emit ADCX or ADOX instructions and call it a day. Lots of users still have older machines, and it's in your interest to support them to the extent possible. They're great if you're preparing builds tuned for specific architectures, but I would not recommend it for general use.

I am sure there are 64 bit opcodes, the equivalent of: add+adc

There are 64-bit versions of ADD and ADC (and ADCX and ADOX) when you're targeting a 64-bit architecture. This would then allow you to implement 128-bit "bigint" arithmetic using the same pattern.

On x86-32, there are no 64-bit versions of these instructions in the base instruction set. You must turn to SSE2, like we saw GCC and Clang do.