Search code examples
cudaptxcarryflag

CUDA - PTX carry propagation


I want to add two 32-bit unsigned integers in CUDA PTX and I also want to take care of the carry propagation. I am using the code below to do that, but the result is not as expected.
Acording to the documentation, the add.cc.u32 d, a, b performs integer addition and writes the carry-out value into the condition code register, that is CC.CF.
On the other hand, addc.cc.u32 d, a, b performs integer addition with carry-in and writes the carry-out value into the condition code register. The semantics of this instruction would be
d = a + b + CC.CF. I also tryed addc.u32 d, a, b with no difference.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>
#include "device_launch_parameters.h"
#include <cuda.h>

typedef unsigned int u32;
#define TRY_CUDA_CALL(x) \
do \
  { \
    cudaError_t err; \
    err = x; \
    if(err != cudaSuccess) \
  { \
    printf("Error %08X: %s at %s in line %d\n", err, cudaGetErrorString(err), __FILE__, __LINE__); \
    exit(err); \
  } \
} while(0)


__device__ u32
__uaddo(u32 a, u32 b) {
    u32 res;
    asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__device__ u32
__uaddc(u32 a, u32 b) {
    u32 res;
    asm("addc.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__global__ void testing(u32* s)
{
    u32 a, b;

    a = 0xffffffff;
    b = 0x2;
    
    s[0] = __uaddo(a,b);
    s[0] = __uaddc(0,0);

}

int main()
{
    u32 *s_dev;
    u32 *s;
    s = (u32*)malloc(sizeof(u32));
    TRY_CUDA_CALL(cudaMalloc((void**)&s_dev, sizeof(u32)));
    testing<<<1,1>>>(s_dev);
    TRY_CUDA_CALL( cudaMemcpy(s, s_dev, sizeof(u32), cudaMemcpyDeviceToHost) );
    
    printf("s = %d;\n",s[0]);
    
    
    return 1;
}

As far as I know, you get a carry if the result doesn't fit in the variable, which happens here and an overflow if the sign bit is corrupted, but I'm working with unsigned values.
The code above tries to add 0xFFFFFFFF to 0x2 and of course the result won't fit on 32-bit, so why I don't get a 1 after __uaddc(0,0) call?

EDIT

Nvidia Geforce GT 520mx
Windows 7 Ultimate, 64-bit
Visual Studio 2012
CUDA 7.0


Solution

  • The only data dependencies affecting an asm() statement are those that are explicitly expressed by the variable bindings. Note that you can bind register operands, but not the condition codes. Since in this code the result of __uaddo(a, b) is immediately being overwritten, the compiler determines that it does not contribute to the observable results, is therefore "dead code" and can be eliminated. This is easily checked by examining the generated machine code (SASS) for a release build with cuobjdump --dump-sass.

    If we had slightly different code that does not allow the compiler to eliminate the code for __uaddo() outright, there would still be the issue that the compiler can schedule any instructions it likes between the code generated for __uaddo() and __uaddc(), and such instructions could destroy any setting of the carry flag due to __uaddo().

    As a consequence, if one plans to use the carry flag for multi-word arithmetic, both carry-generating and carry-consuming instructions must occur in the same asm() statement. A worked example can be found in this answer that shows how to add 128-bit operands. Alternatively, if two separate asm() statements must be used, one could export the carry flag setting from the earlier one into a C variable, then import it into the subsequent asm() statement from there. I can't think of many situations where this would be practical, as the performance advantage of using the carry flag is likely lost.