Search code examples
c++cudaatomichalf-precision-float

atomicAdd half-precision floating-point (FP16) on CUDA Compute Capability 5.2


I am trying to atomically add a float value to a __half in CUDA 5.2. This architecture does support the __half data type and its conversion functions, but it does not include any arithmetic and atomic operations for halves, like atomicAdd().

I created the following atomicAdd() function wrapper with a special case for when half-precision arithmetic is unsupported. full example code

__device__ void atomic_add(__half* a, float b) {
    #if __CUDA_ARCH__ >= 700 // CUDA 7.0 supports fp16 atomic add
        atomicAdd(a, __float2half(b));
    #else
        atomicAdd(&__half2float(a), b); // Error: expression must be an lvalue
    #endif
}

atomicAdd(&__half2float(a), b); does not work, because __half2float(a) is not an lvalue. I could make a an lvalue by creating a copy:

float a_tmp = __half2float(a);
atomicAdd(&a_tmp , b);
a = __float2half(a_tmp);

But now the atomic function doesn't serve any purpose because I'm working on a copy of the value I actually want to modify atomically.

Is there another way that I haven't thought of in which I could perform this operation?


Solution

  • As it happens, compute capability 5.2 devices basically don't support 16-bit atomics of any type. There is some evidence of this is in the programming guide, and furthermore if you try to use 16-bit (unsigned short) atomicCAS on an architecture less than cc7.0, you will get a compile error - its not supported, although that's not obvious from the programming guide. (Yes, I have already filed an internal bug 3845962 at NVIDIA to have the documentation improved in this respect.)

    The programming guide does illustrate the general formula to do atomicCAS based custom atomics, and we will use that recipe. However the other "ingredient" is that we are going to have to realize this with a 32-bit atomic. Generally speaking, it is possible to use a larger atomic on a smaller data type - you just don't modify anything outside of the data region of interest.

    But one of the requirements that comes out of this approach is that you must make sure that the atomic access will be legal. This means that you must allocate in units of 32-bits (for the 32-bit atomic) even though the type of interest is __half i.e. 16-bits.

    With that proviso the general methodology is the same as is already covered in the programming guide and other SO questions.

    The following is one possible approach:

    $ cat t2173.cu
    #include <cuda_fp16.h>
    #include <iostream>
    #include <cstdio>
    
    // this requires a full 32-bit allocation at the atomic address
    __device__ float my_float_half_atomicAdd(__half *a, float b){
    
      bool uplo = ((unsigned long long)a)&2;  // check if the atomic is for the upper or lower 16-bit quantity in the aligned 32-bit item
      unsigned *addr = reinterpret_cast<unsigned *>(((unsigned long long)a)&0xFFFFFFFFFFFFFFFCULL); // get the 32-bit aligned address
      unsigned old = *addr;
      unsigned val;
      do {
        val = old;
        float newval = __half2float(__ushort_as_half(uplo?((unsigned short)(val>>16)):((unsigned short)(val))))+b;
        unsigned short newval_s = __half_as_ushort(__float2half(newval));
        unsigned newval_u = val&(uplo?(0x0FFFFU):(0xFFFF0000U));
        newval_u |= uplo?(((unsigned)newval_s)<<16):(newval_s);
        old = atomicCAS(addr, old, newval_u);}
      while (old != val);
      return __half2float(__ushort_as_half(uplo?(old>>16):(old)));
    }
    
    
    __device__ float f_h_atomic_add(__half* a, float b) {
        #if __CUDA_ARCH__ >= 700 // CUDA 7.0 supports fp16 atomic add
            return __half2float(atomicAdd(a, __float2half(b)));
        #else
            return my_float_half_atomicAdd(a, b);
        #endif
    }
    
    __global__ void k(__half *a, float b){
      printf("%f\n", f_h_atomic_add(a, b));
    }
    
    
    int main(){
    
      __half *a;
      cudaMallocManaged(&a, 4); // must allocate 32-bit quantities
      memset(a, 0, 4);
      k<<<2,64>>>(a, 1.0f);
      cudaDeviceSynchronize();
      float val = __half2float(a[0]);
      std::cout << val << std::endl;
    }
    $ nvcc -arch=sm_35 -o t2173 t2173.cu
    nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
    $ CUDA_VISIBLE_DEVICES="1" cuda-memcheck ./t2173
    ========= CUDA-MEMCHECK
    0.000000
    1.000000
    2.000000
    3.000000
    8.000000
    9.000000
    10.000000
    11.000000
    16.000000
    17.000000
    18.000000
    19.000000
    24.000000
    25.000000
    26.000000
    27.000000
    4.000000
    5.000000
    6.000000
    7.000000
    12.000000
    13.000000
    14.000000
    15.000000
    20.000000
    21.000000
    22.000000
    23.000000
    28.000000
    29.000000
    30.000000
    31.000000
    32.000000
    33.000000
    34.000000
    35.000000
    40.000000
    41.000000
    42.000000
    43.000000
    48.000000
    49.000000
    50.000000
    51.000000
    57.000000
    58.000000
    59.000000
    60.000000
    36.000000
    37.000000
    38.000000
    39.000000
    44.000000
    45.000000
    46.000000
    47.000000
    52.000000
    53.000000
    54.000000
    56.000000
    61.000000
    62.000000
    63.000000
    64.000000
    89.000000
    90.000000
    91.000000
    55.000000
    65.000000
    66.000000
    67.000000
    68.000000
    73.000000
    74.000000
    75.000000
    76.000000
    81.000000
    82.000000
    83.000000
    84.000000
    92.000000
    93.000000
    94.000000
    95.000000
    69.000000
    70.000000
    71.000000
    72.000000
    77.000000
    78.000000
    79.000000
    80.000000
    85.000000
    86.000000
    87.000000
    88.000000
    123.000000
    124.000000
    125.000000
    126.000000
    99.000000
    100.000000
    101.000000
    102.000000
    107.000000
    108.000000
    109.000000
    110.000000
    115.000000
    116.000000
    117.000000
    118.000000
    127.000000
    96.000000
    97.000000
    98.000000
    103.000000
    104.000000
    105.000000
    106.000000
    111.000000
    112.000000
    113.000000
    114.000000
    119.000000
    120.000000
    121.000000
    122.000000
    128
    ========= ERROR SUMMARY: 0 errors
    $
    

    (With CUDA 11.4 at least, this methodology can work on devices all the way back to cc3.5, which is what is demonstrated above.)

    FP16 has fairly limited range compared to FP32, so that is something to keep in mind when adding float quantities to __half values.