Search code examples
c++cudathrust

fp16 support in cuda thrust


I am not able to found anything about the fp16 support in thrust cuda template library. Even the roadmap page has nothing about it: https://github.com/thrust/thrust/wiki/Roadmap

But I assume somebody has probably figured out how to overcome this problem, since the fp16 support in cuda is around for more than 6 month.

As of today, I heavily rely on thrust in my code, and templated nearly every class I use in order to ease fp16 integration, unfortunately, absolutely nothing works out of the box for half type even this simple sample code:

//STL
#include <iostream>
#include <cstdlib>

//Cuda
#include <cuda_runtime_api.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cuda_fp16.h>
#define T half //work when float is used

int main(int argc, char* argv[])
{
        thrust::device_vector<T> a(10,1.0f);
        float t = thrust::reduce( a.cbegin(),a.cend(),(float)0);
        std::cout<<"test = "<<t<<std::endl;
        return EXIT_SUCCESS;
}

This code cannot compile because it seems that there is no implicit conversion from float to half or half to float. However, it seems that there are intrinsics in cuda that allow for an explicit conversion.

Why can't I simply overload the half and float constructor in some header file in cuda, to add the previous intrinsic like that :

float::float( half a )
{
  return  __half2float( a ) ;
}

half::half( float a )
{
  return  __float2half( a ) ;
}

My question may seem basic but I don't understand why I haven't found much documentation about it.

Thank you in advance


Solution

  • The very short answer is that what you are looking for doesn't exist.

    The slightly longer answer is that thrust is intended to work on fundamental and POD types only, and the CUDA fp16 half is not a POD type. It might be possible to make two custom classes (one for the host and one for the device) which implements all the required object semantics and arithmetic operators to work correctly with thrust, but it would not be an insignificant effort to do it (and it would require writing or adapting an existing FP16 host library).

    Note also that the current FP16 support is only in device code and only on compute 5.3 and newer devices. So unless you have a Tegra TX1, you can't use the FP16 library in device code anyway.