I'm working with CUDA (GPGPU programming) for some research and the innate Double Precision performance suffers compares to the Single Precision performance (by a factor of 24!), due to a new hardware architecture. I've decided to try and use two uint's to represent one double. This way I could run DP calculations without a significant performance hit.
For example, let's say we want to represent the double 10.12 using this method.
uint real = 10, decimal = 12;
Therefore; 'real.decimal' visually represents our double.
Let's call this type a 'single'.
Given: single a = 10.12, b = 20.24;
What would be an efficient algorithm to multiply, divide, add, and subtract two single's?
single c = a * b;
Please remember, for this to work, it's not possible to do ANY DP calculations or use a data type larger than 32 bits.
If you want to replace double precision, you could look into using a "double single" representation where the operands are pairs of single-precision numbers refered to as the "head" and the "tail", and which satisfy the normalization requirement that |tail| <= 0.5 * ulp (|head|). CUDA's float2 is the appropriate type to hold the pairs of floats needed. For example, the less signficant x-component would represent the "tail", and the more significant y-component the "head".
Note that this does not provide exactly the same accuracy as double precision. The precision of this representation is limited to 49 bits (24 bits from each of the single-precision mantissa plus 1 bit the sign bit of the tail) vs 53 for double precision. The operations won't provide IEEE rounding, and the range may also be reduced somewhat due to overflow of temporary quantities inside the multiplication code. Subnormal quantities may also not be accurately representable.
I don't think the performance will be significantly better than using the native double-precision operations of your GPU, as register pressure will be higher and the number of single-precision operations required for each "double single" operation is fairly substantial. You could of course just try and measure the performance for both variants.
Below is an example of an addition in "double single" format. The source of the algorithm is noted in the comment. I believe that the cited work is in turn based on the work of D. Priest who did research in this subject matter in the late 1980s or early 1990s. For a worked example of a "double single" multiplication, you could take a look at the function __internal_dsmul() in the file math_functions.h that ships with CUDA.
A quick remark on 64-bit integer operations in CUDA (as one commenter noted this as a potential alternative). Current GPU hardware has very limited support for native 64-bit integer operations, basically providing conversion from and to floating-point types, and indexing using 64-bit addresses in loads and stores. Otherwise 64-bit integer operations are implemented via native 32-bit operations, as one can easily see by looking at disassembled code (cuobjdump --dump-sass).
/* Based on: Andrew Thall, Extended-Precision Floating-Point Numbers for GPU
Computation. Retrieved from http://andrewthall.org/papers/df64_qf128.pdf
on 7/12/2011.
*/
__device__ float2 dsadd (float2 a, float2 b)
{
float2 z;
float t1, t2, t3, t4, t5, e;
t1 = __fadd_rn (a.y, b.y);
t2 = __fadd_rn (t1, -a.y);
t3 = __fadd_rn (__fadd_rn (a.y, t2 - t1), __fadd_rn (b.y, -t2));
t4 = __fadd_rn (a.x, b.x);
t2 = __fadd_rn (t4, -a.x);
t5 = __fadd_rn (__fadd_rn (a.x, t2 - t4), __fadd_rn (b.x, -t2));
t3 = __fadd_rn (t3, t4);
t4 = __fadd_rn (t1, t3);
t3 = __fadd_rn (t1 - t4, t3);
t3 = __fadd_rn (t3, t5);
z.y = e = __fadd_rn (t4, t3);
z.x = __fadd_rn (t4 - e, t3);
return z;
}