Search code examples
performanceintegerssebignumarbitrary-precision

Can long integer routines benefit from SSE?


I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.

Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:

  • SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)

  • The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)

  • The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.

My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?


Solution

  • In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.

    But before I continue, time for some background terminology:

    1. Full Word Arithmetic
    2. Partial Word Arithmetic


    Full-Word Arithmetic:

    This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers. Many bignum libraries and applications (including GMP) use this representation.

    In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.

    It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.


    Partial-Word Arithmetic

    This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.

    The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.

    In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.


    Problems with Full-Word Arithmetic:

    Vectorizing full-word arithmetic has historically been a lost cause:

    1. SSE/AVX2 has no support for carry-propagation.
    2. SSE/AVX2 has no 128-bit add/sub.
    3. SSE/AVX2 has no 64 x 64-bit integer multiply.*

    *AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.

    Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:

    • Add-with-Carry: adc, adcx, adox.
    • Double-word Multiply: Single-operand mul and mulx.

    In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.

    With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.

    AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.

    But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".


    Vectorization of Partial-Word Arithmetic

    With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.

    So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.

    Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.

    Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.

    Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.

    AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.


    Large Multiplication using FFTs

    For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).

    FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is a partial word representation.


    To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.

    If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.

    But nevertheless, bignum arithmetic is possible to vectorize.


    Disclosure:

    I'm the author of y-cruncher which does plenty of large number arithmetic.