Search code examples
c++templatesgcccudaexplicit-instantiation

Using function-templated code across the g++-nvcc boundary (including kernels)


Suppose I compile the following with NVIDIA CUDA's nvcc compiler:

template<typename T, typename Operator>
__global__ void fooKernel(T t1, T t2)  {
    Operator op;
    doSomethingWith(t1, t2);
}

template<typename T>
__device__ __host__ void T bar(T t1, T t2)  {
    return t1 + t2;
}

template<typename T, typename Operator>
void foo(T t1, T t2)  {
    fooKernel<<<2, 2>>>(t1, t2);
}

// explicit instantiation
template decltype(foo<int, bar<int>>) foo<int, bar<int>);

Now, I want my gcc, non-nvcc code to call foo():

...

template<typename T, typename Operator> void foo(T t1, T t2);


foo<int, bar<int>> (123, 456);
...

I have the appropriate (?) instantiation in the .o/.a/.so file I compile with CUDA.

Can I make that happen?


Solution

  • The problem here is that templated code is typically instantiated at the place of usage, which doesn't work because foo contains a kernel call which cannot be parsed by g++. Your approach of explicitly instantiating the template and forward declaring it for the host compiler is the right one. Here's how to do this. I slightly fixed up your code and split it into 3 files:

    1. gpu.cu
    2. gpu.cuh
    3. cpu.cpp

    gpu.cuh

    This file contains the templated code for use by gpu.cu. I added some purpose to your foo() function to make sure it works.

    #pragma once
    #include <cuda_runtime.h>
    
    template <typename T>
    struct bar {
        __device__ __host__ T operator()(T t1, T t2)
        {
            return t1 + t2;
        }
    };
    
    template <template <typename> class Operator, typename T>
    __global__ void fooKernel(T t1, T t2, T* t3)
    {
        Operator<T> op;
        *t3 = op(t1, t2);
    }
    
    template <template <typename> class Operator, typename T>
    T foo(T t1, T t2)
    {
        T* t3_d;
        T t3_h;
        cudaMalloc(&t3_d, sizeof(*t3_d));
        fooKernel<Operator><<<1, 1>>>(t1, t2, t3_d);
        cudaMemcpy(&t3_h, t3_d, sizeof(*t3_d), cudaMemcpyDeviceToHost);
        cudaFree(t3_d);
        return t3_h;
    }
    

    gpu.cu

    This file only instantiates the foo() function to make sure it will be available for linking:

    #include "gpu.cuh"
    
    template int foo<bar>(int, int);
    

    cpu.cpp

    In this plain C++ source file, we need to make sure we do not get the template instantiations, as that would give a compile error. Instead we only forward declare the struct bar and the function foo. The code looks like this:

    #include <cstdio>
    
    template <template <typename> class Operator, typename T>
    T foo(T t1, T t2);
    
    template <typename T>
    struct bar;
    
    int main()
    {
        printf("%d \n", foo<bar>(3, 4));
    }
    

    Makefile

    This will put the code all together into an executable:

    .PHONY: clean all
    all: main
    
    clean:
            rm -f *.o main
    
    main: gpu.o cpu.o
            g++ -L/usr/local/cuda/lib64 $^ -lcudart -o $@
    
    gpu.o: gpu.cu
            nvcc -c -arch=sm_20 $< -o $@
    
    cpu.o: cpu.cpp
            g++ -c $< -o $@
    

    Device code is compiled by nvcc, host code by g++ and it all gets linked by g++. Upon running you see the beautiful result:

    7
    

    The key thing to remember here is that kernel launches and kernel definitions have to be in the .cu files that are compiled by nvcc. For future reference, I will also leave this link here, on separation of linking and compilation with CUDA.