java bit-manipulation 64-bit biginteger multiplication

How can I handle 128 bit little endian multiplication in Java without resorting to BigInteger

I need to multiply two 8 byte (64 bit) arrays in the fastest way possible. The byte arrays are little endian. The arrays can be wrapped in a ByteBuffer and treated as little endian to easily resolve a java "long" value that correctly represents the bytes (but not the real nominal value since java longs are 2s compliment).

Java's standard way to handle large math is BigInteger. But that implementation is slow and unnecessary since im very strictly working with 64 bits x 64 bits. In addition, you can't throw the "long" value into one because the nominal value is incorrect, and I can't use the byte array directly because it's little endian. I need to be able to do this without having to use up more memory / CPU to reverse the array. This type of multiplication should be able to execute 1m+ times per second. BigInteger doesn't really come close to meeting that requirement anyway, so I'm trying to do it via splitting the high order bits from the low order bits, but I can't get it working consistently.

The high-order-bits-only code is only working for a subset of longs because even the intermediate addition can overflow. I got my current code from this answer....

high bits of long multiplication in Java?

Is there a more generic pattern for getting hi/lo order bits from 128 bit multiplication? That works for the largest long values?

Edit:

FWIW I'm prepared for the answer to be.. "cant do that in java, do it in c++ and call via JNI". Though I'm hoping someone can give a java solution before it comes to that.

Solution

As of Java 9 (which was a bit too new at the time this question was asked), there is now a trivial way to get the upper half of the 128-bit product of two signed 64-bit integers: Math.multiplyHigh

There is a relatively simple conversion from "upper half of signed product" to "upper half unsigned product" (see Hacker's Delight chapter 8), which can be used to implement an unsigned multiply high like this:

static long multiplyHighUnsigned(long x, long y) {
    long signedUpperHalf = Math.multiplyHigh(x, y);
    return signedUpperHalf + ((x >> 63) & y) + ((y >> 63) & x);
}

This has the potential to be more efficient (on platforms on which multiplyHigh is treated as an intrinsic function by the JIT) than the more manual approach used by the old answer, which I will leave below the line.

It can be done manually without BigInteger by splitting the longs up into two halves, creating the partial products, and then summing them up. Naturally the low half of the sum can be left out.

The partial products overlap, like this:

  LL
 LH
 HL
HH

So the high halves of LH and HL must be added to the high result, and furthermore the low halves of LH and HL together with the high half of LL may carry into the bits of the high half of the result. The low half of LL is not used.

So something like this (only slightly tested):

static long hmul(long x, long y) {
    long m32 = 0xffffffffL;
    // split
    long xl = x & m32;
    long xh = x >>> 32;
    long yl = y & m32;
    long yh = y >>> 32;
    // partial products
    long t00 = xl * yl;
    long t01 = xh * yl;
    long t10 = xl * yh;
    long t11 = xh * yh;
    // resolve sum and carries
    // high halves of t10 and t01 overlap with the low half of t11
    t11 += (t10 >>> 32) + (t01 >>> 32);
    // the sum of the low halves of t10 + t01 plus
    // the high half of t00 may carry into the high half of the result
    long tc = (t10 & m32) + (t01 & m32) + (t00 >>> 32);
    t11 += tc >>> 32;
    return t11;
}

This of course treats the input as unsigned, which does not mean they have to be positive in the sense that Java would treat them as positive, you can absolutely input -1501598000831384712L and -735932670715772870L and the right answer comes out, as confirmed by wolfram alpha.

If you are prepared to interface with native code, in C++ with MSVC you could use __umulh, and with GCC/Clang you can make the product as an __uint128_t and just shift it right, the codegen for that is actually fine, it doesn't cause a full 128x128 multiply.