OpenCL 128bit multiplication result

I need to multipltcate two unsigned 64 bit integers (unsigned long) inside an OpenCL-kernel, the result an 128 bit integer (unsigned long long).

Newer versions of openCL seem to support this type.

unsigned long m1, m2;
.
.
unsigned long long result = m1 * (unsigned long long)m2;

This code works, but is quite slow. This is essentially multiplying 64 bit with 128 bit. I only need 64 bit with 64 bit.

Is there a way to set the result type of a multiplication, without converting one multiplicand to 128 bit?

Solution

A decent compiler should notice your 64->128bit upcast and not produce any machine code for the zeroed high source bits.

However, GPUs tend to be quite slow at large integer multiplication. For example, according to the latest information I'm aware of, AMD's GCN GPUs are 5 times faster at multiplying floats than 32*32bit integers. I suspect that's with only a 32-bit (low) result though, as getting the high 32 bits is a separate instruction, so it's presumably even slower for the full 64-bit result.

Most GPUs these days are much faster at working with 24-bit integers. (5 times as fast in the case of the aforementioned AMD GPUs.) I wonder if you might be able to decompose your 64-bit integers into 3 24-bit words (or even 2 if your values will fit in 48 bits) and implement the long multiplication by hand. (Possibly via Karatsuba's or similar algorithms; not sure which will work best as mul, add, and mad tend to be equally fast as each other on GPUs.) Getting at the high 16 bits of each 24x24 bit multiplication will be the hard part though, as OpenCL doesn't appear to give you access to that via a dedicated function, unlike the low 32 bits via mul24. If you're targeting one or more specific OpenCL implementation(s), it might be possible to hand-write assembly language for the GPU(s) you're targeting though.