I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers. Here is my sample code:
long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1; \n\t" // 0 * -1 = 0
" mul.hi.s64 %1, 0, -1; \n\t"
: "=l"(result_lo), "=l"(result_hi));
This should produce the result result_lo = 0x0, result_hi = 0x0
. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF
which is actualy the value 2^127 - (2^126 - 1)
if I'm not mistaken and clearly not zero.
First off, I want to make sure my understanding is correct, but moreso, is there a way around this?
Update Changing from Debug
mod to Release
mode fixes this issue, still wondering if this is a bug in cuda?
Update 2 Reported this bug to NVIDIA
Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug
, sm_52
, compute_52
.
TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64
that is specific to sm_5x
platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x
and sm_3x
platforms, these are constructed from the machine code instruction IMAD.U32
, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x
), a high-throughput, but lower-width, integer multiply-add instruction XMAD
was introduced, although a low-throughput legacy 32-bit integer multipy IMUL
was apparently retained. Inspection of disassembled machine code generated for sm_5x
by the CUDA 7.5 toolchain with cuobjdump --dumpsass
shows that for ptxas
optimization level -O0
(which is used for debug builds), the 64-bit multiplies are emulated with the IMUL
instruction, while for optimization level -O1
and higher XMAD
is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL
-based emulation for mul.hi.s64
for sm_5x
is broken while the XMAD
-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1
for ptxas
, by specifying -Xptxas -O1
on the nvcc
command line. Note that release builds use -Xptxas -O3
by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64
is implemented as a wrapper around the emulation for mul.hi.u64
, and this latter emulation seems to work fine on all platforms including sm_5x
. Thus another possible workaround is to use our own wrapper around mul.hi.u64
. Coding with inline PTX is unnecessary in this case, since mul.hi.s64
and mul.hi.u64
are accessible via the device intrinsics __mul64hi()
and __umul64hi()
. As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif