Can AVX-512 be used to calculate the multiplication of two 256-bit Integers on a 64 bit computer in C++? Whether using Assembly Language or not

Regarding arbitrary precision arithmetic: Can AVX-512 be used to calculate the multiplication of two 256-bit Integers on a 64-bit computer in C++? Is there an intrinsic Integer data type of 512 bits for the result, in a C++ compiler, that makes it possible?

If its not possible, can it be done in Assembly language? If so, can this Assembly Language routine be called from a C++ program? Could a 512-bit result from an Assembly Language routine be utilised by 64-bit Integer data types without using a String?

Solution

Yes and no.

No, AVX-512 doesn't have an instruction to multiply two 256-bit operands to get a 512-bit result. The largest operands you can multiply are 64 bits apiece, to get a 128-bit result.

So, to multiply 256-bit operands, you do a bit like grade-school multiplication. Except that you want to treat a 256-bit number as basically a 4-digit number in base 2⁶⁴. So, you have something like:

          D  C  B  A
          H  G  F  E
--------------------
         ED EC EB EA
      FD FC FB FA 0
   GD GC GB GA 0  0
HD HC HB HA 0  0  0
--------------------

[where HD means H*D, GB means G*B, and so on.]

So just like we did in grade-school, we multiply each digit of the one by each digit of the other, carry the upper digit of each to the next column, and add all the partial products together to get a final result.

The difference is that instead of each being a digit from 0 to 9, in this case each digit goes from 0 to 2⁶⁴-1.

But yes, since you have to carry out identical operations on each of multiple operands, it's perfectly reasonable to carry them out in parallel using vector instructions. But that's just doing a bunch of 64-bit multiplications in parallel, not working with any individual operand that's any bigger than 64 bits¹.

If you're working with really big numbers (e.g., millions of digits or more) there are better multiplication methods, such as Karatsuba multiplication and Schönhage-Strassen multiplication -- but 512 bits is almost certainly too small to benefit noticeably from things like that.

I should add, however, that although the idea of vectored execution is valid, it may be a lot of work for little or no gain, so unless you're doing a lot of these in a tight loop, it's open to a lot of question whether it'll be worthwhile.