atomicAdd half-precision floating-point (FP16) on CUDA Compute Capability 5.2

I am trying to atomically add a float value to a __half in CUDA 5.2. This architecture does support the __half data type and its conversion functions, but it does not include any arithmetic and atomic operations for halves, like atomicAdd().

I created the following atomicAdd() function wrapper with a special case for when half-precision arithmetic is unsupported. full example code

__device__ void atomic_add(__half* a, float b) {
    #if __CUDA_ARCH__ >= 700 // CUDA 7.0 supports fp16 atomic add
        atomicAdd(a, __float2half(b));
    #else
        atomicAdd(&__half2float(a), b); // Error: expression must be an lvalue
    #endif
}

atomicAdd(&__half2float(a), b); does not work, because __half2float(a) is not an lvalue. I could make a an lvalue by creating a copy:

float a_tmp = __half2float(a);
atomicAdd(&a_tmp , b);
a = __float2half(a_tmp);

But now the atomic function doesn't serve any purpose because I'm working on a copy of the value I actually want to modify atomically.

Is there another way that I haven't thought of in which I could perform this operation?

Solution

As it happens, compute capability 5.2 devices basically don't support 16-bit atomics of any type. There is some evidence of this is in the programming guide, and furthermore if you try to use 16-bit (unsigned short) atomicCAS on an architecture less than cc7.0, you will get a compile error - its not supported, although that's not obvious from the programming guide. (Yes, I have already filed an internal bug 3845962 at NVIDIA to have the documentation improved in this respect.)

The programming guide does illustrate the general formula to do atomicCAS based custom atomics, and we will use that recipe. However the other "ingredient" is that we are going to have to realize this with a 32-bit atomic. Generally speaking, it is possible to use a larger atomic on a smaller data type - you just don't modify anything outside of the data region of interest.

But one of the requirements that comes out of this approach is that you must make sure that the atomic access will be legal. This means that you must allocate in units of 32-bits (for the 32-bit atomic) even though the type of interest is __half i.e. 16-bits.

With that proviso the general methodology is the same as is already covered in the programming guide and other SO questions.

The following is one possible approach:

$ cat t2173.cu
#include <cuda_fp16.h>
#include <iostream>
#include <cstdio>

// this requires a full 32-bit allocation at the atomic address
__device__ float my_float_half_atomicAdd(__half *a, float b){

  bool uplo = ((unsigned long long)a)&2;  // check if the atomic is for the upper or lower 16-bit quantity in the aligned 32-bit item
  unsigned *addr = reinterpret_cast<unsigned *>(((unsigned long long)a)&0xFFFFFFFFFFFFFFFCULL); // get the 32-bit aligned address
  unsigned old = *addr;
  unsigned val;
  do {
    val = old;
    float newval = __half2float(__ushort_as_half(uplo?((unsigned short)(val>>16)):((unsigned short)(val))))+b;
    unsigned short newval_s = __half_as_ushort(__float2half(newval));
    unsigned newval_u = val&(uplo?(0x0FFFFU):(0xFFFF0000U));
    newval_u |= uplo?(((unsigned)newval_s)<<16):(newval_s);
    old = atomicCAS(addr, old, newval_u);}
  while (old != val);
  return __half2float(__ushort_as_half(uplo?(old>>16):(old)));
}


__device__ float f_h_atomic_add(__half* a, float b) {
    #if __CUDA_ARCH__ >= 700 // CUDA 7.0 supports fp16 atomic add
        return __half2float(atomicAdd(a, __float2half(b)));
    #else
        return my_float_half_atomicAdd(a, b);
    #endif
}

__global__ void k(__half *a, float b){
  printf("%f\n", f_h_atomic_add(a, b));
}


int main(){

  __half *a;
  cudaMallocManaged(&a, 4); // must allocate 32-bit quantities
  memset(a, 0, 4);
  k<<<2,64>>>(a, 1.0f);
  cudaDeviceSynchronize();
  float val = __half2float(a[0]);
  std::cout << val << std::endl;
}
$ nvcc -arch=sm_35 -o t2173 t2173.cu
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ CUDA_VISIBLE_DEVICES="1" cuda-memcheck ./t2173
========= CUDA-MEMCHECK
0.000000
1.000000
2.000000
3.000000
8.000000
9.000000
10.000000
11.000000
16.000000
17.000000
18.000000
19.000000
24.000000
25.000000
26.000000
27.000000
4.000000
5.000000
6.000000
7.000000
12.000000
13.000000
14.000000
15.000000
20.000000
21.000000
22.000000
23.000000
28.000000
29.000000
30.000000
31.000000
32.000000
33.000000
34.000000
35.000000
40.000000
41.000000
42.000000
43.000000
48.000000
49.000000
50.000000
51.000000
57.000000
58.000000
59.000000
60.000000
36.000000
37.000000
38.000000
39.000000
44.000000
45.000000
46.000000
47.000000
52.000000
53.000000
54.000000
56.000000
61.000000
62.000000
63.000000
64.000000
89.000000
90.000000
91.000000
55.000000
65.000000
66.000000
67.000000
68.000000
73.000000
74.000000
75.000000
76.000000
81.000000
82.000000
83.000000
84.000000
92.000000
93.000000
94.000000
95.000000
69.000000
70.000000
71.000000
72.000000
77.000000
78.000000
79.000000
80.000000
85.000000
86.000000
87.000000
88.000000
123.000000
124.000000
125.000000
126.000000
99.000000
100.000000
101.000000
102.000000
107.000000
108.000000
109.000000
110.000000
115.000000
116.000000
117.000000
118.000000
127.000000
96.000000
97.000000
98.000000
103.000000
104.000000
105.000000
106.000000
111.000000
112.000000
113.000000
114.000000
119.000000
120.000000
121.000000
122.000000
128
========= ERROR SUMMARY: 0 errors
$

(With CUDA 11.4 at least, this methodology can work on devices all the way back to cc3.5, which is what is demonstrated above.)

FP16 has fairly limited range compared to FP32, so that is something to keep in mind when adding float quantities to __half values.