Cuda signed 128-bit multiplication error

I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers. Here is my sample code:

long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1;     \n\t" // 0 * -1 = 0
    " mul.hi.s64 %1, 0, -1;     \n\t"
    : "=l"(result_lo), "=l"(result_hi));

This should produce the result result_lo = 0x0, result_hi = 0x0. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF which is actualy the value 2^127 - (2^126 - 1) if I'm not mistaken and clearly not zero.

First off, I want to make sure my understanding is correct, but moreso, is there a way around this?

Update Changing from Debug mod to Release mode fixes this issue, still wondering if this is a bug in cuda?

Update 2 Reported this bug to NVIDIA

Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug, sm_52, compute_52.

Solution

TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.

Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.

For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.

As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.

From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.

    long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
    asm(" mul.hi.s64 %0, %1, %2;     \n\t"
        : "=l"(result)
        : "l"(m1), "l"(m2));
#else
    result = __umul64hi (m1, m2);
    if (m1 < 0LL) result -= m2;
    if (m2 < 0LL) result -= m1;
#endif