Can I parallelize huge integer additions?

I will need to add two unsigned 256 megabit integers over two billion times. Since carrying is obviously very important in addition and cannot be determined without waiting for lower order bits to be added, are there any performance gains to be had from multicore CPU features, such as splitting the number into multiple parts and dealing with carries later?

Solution

You can definitely separate this up into many pieces. For example, take these two numbers:

  12345
+ 67890

Now we'll split them after the third digit, between the hundreds and tens columns. This gives us

  123      45
+ 678    + 90

Calculate the results of each

  123      45
+ 678    + 90
-------------
  801     135

On the left number set you need to know how many digits you chopped off, in this case, two digits, so add two zeros back onto the end of 801, giving you 80100. And add 135 to it, and you have 80235.

You could do this with much larger numbers, and as many splits as you would like. Using this method prevents any carrying from occurring.

Of course, when you recombine large numbers you're still left with large additions. You could probably figure out how many digits have carried, and just add that small amount to your left-hand number.

For instance, in our above example, our number on the right ended up going from 2 columns to 3 columns, with the result being 135. So the extra column is the number to be carried, which could be added to your 801. This allows you to add to the small number, and then just concatenate the two numbers like you would a string

45 and 90 both took up two columns, which added made 135. We take any extra columns generated, in this case, just the 1, and add it to our left-hand number, 801.

801 + 1 = 802   
802 concatenated with 35 = 80235

If you want something extremely efficient, I'm sure you could look-up how 32-bit processors add 64-bit or larger numbers. I'm sure they do something similar for 64-bit numbers, adding the two 32-bit sections, and carrying over from the least significant 32-bit to the most significant.

And in terms of parallelization, split up your number into 32-bit pairs to be added together, then determine how many available threads the CPU can handle at once, and split up your list of pairs by that much and give that much to each thread. When the results are calculated, put them in a completed section.

The trick of carrying the numbers from the least significant to the most significant once you get all the results back will be tricky, as adding even a single 1 value to a number can cause it roll over another number as well.