Search code examples
cassemblyx86twos-complementbigint

Signed Multiplication of 1024-bit 2's complement numbers using 32-bit chunks


So I have the following struct definition for my 1024-bit number (I want to use 2's complement representation here and I am on a 32-bit system):

typedef struct int1024
{
    int32_t num[32]; //should I use uint32_t?
} int1024;

Basically an array that holds the segments of the number.

For add since, signed and unsigned addition are the same. I can simply use the instructions add and adc to carry out the extended operation for my bigger array.

Now for multiplication. I was able to get an unsigned multiplication to work using a combination of imul (using the upper and lower results) and adc using the classic O(n^2) multiplication algorithm. However I need signed multiplication for my results. I know people say just to use the signed magnitude version. i.e. take the 2's complement when negative and at the end apply 2's complement if needed. But all the extra noting and adding 1 will be really expensive, since I need to do a lot of multiplications. Is there a technique for doing large signed multiplication for number representations like this. (I don't really want any branching or recursive calls here). I guess one of my main problems is how to deal with carry and the high part of the multiplication of negative numbers. (just an idea, maybe I could use a combination of signed and unsigned multiplication, but I don't know where to begin). If you don't have time for 1024 bits, a 128 bit answer will suffice (I will just generalize it).


Solution

  • Note that in a 1024-bit integer only the very top bit, bit 1023, is the sign bit and has a place-value of -(2**1023). All the other bits have their normal +(2**pos) place value. i.e. the lower "limbs" all need to be treated as unsigned when you do widening multiplies on them.

    You have one sign bit and 1023 lower bits. Not one sign bit per limb.

    Also, the difference between signed and unsigned multiply is only in the high half of a widening multiply (N x N => 2N bits). That's why x86 has separate instructions for imul r/m64 and mul r/m64 to do a full-multiply into RDX:RAX (https://www.felixcloutier.com/x86/imul vs. https://www.felixcloutier.com/x86/mul). But for non-widening there's only imul r, r/m and imul r, r/m, imm which compilers use for both unsigned and signed (and so should humans).

    Since you want a 1024x1024 => 1024-bit product which discards the upper 1024 bits, you can and should just make your whole thing unsigned.

    When you're writing in asm, you can use whatever chunk size you can, e.g. 64-bit in 64-bit mode, not limiting yourself to the C chunk size.

    See https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for how to use BMI2 mulx (leaves FLAGS untouched so doesn't disturb adc chains, and only has RDX as an implicit input, with other source and both outputs being explicit, saving on MOV instructions). And optionally also Broadwell ADOX / ADCX to run two dep chains in parallel to get more ILP for medium sized BigInteger stuff like 512x512-bit or 512x64-bit (which they use as an example).