Search code examples
c++compiler-errorscudainitializationgpu-shared-memory

Getting CUDA error "declaration is incompatible with previous "variable_name"


I'm trying to compile a program including a kernel with MSVS 2012 and CUDA. I use shared memory, but unlike in this question regarding the same problem, I only use my variable name for this kernel's shared memory once, so there's no issue of redefinition. With code like this:

template<typename T>
__global__ void mykernel(
    const T* __restrict__ data,
    T*       __restrict__ results) 
{
    extern __shared__ T warp_partial_results[];
    /* ... */
    warp_partial_results[lane_id] = something;
    /* ... */
    results[something_else] = warp_partial_results[something_else];
    /* ... */
}

which is instantiated for several types (e.g. float, int, unsigned int), I get the dreaded

declaration is incompatible with previous "warp_partial_results"

message. What could cause this?


Solution

  • CUDA doesn't immediately 'support' dynamically-allocated shared memory arrays in templated functions, as it (apparently) generates actual definitions of those extern's. If you instantiate a templated function for multiple types, the definitions would conflict.

    A workaround is available in the form of template specialization via classes. You can choose either NVIDIA's implementation, or a nicer convenient one mentioned below.

    The NVIDIA implementation

    See:

    http://www.ecse.rpi.edu/~wrf/wiki/ParallelComputingSpring2015/cuda/nvidia/samples/0_Simple/simpleTemplates/sharedmem.cuh

    You use the workaround as follows:

    template<class T> __global__ void foo( T* g_idata, T* g_odata)
    {
        // shared memory
        // the size is determined by the host application
        
        SharedMem<T> shared;
        T* sdata = shared.getPointer();
    
        // .. the rest of the code remains unchanged!
    }
    

    the getPointer() has* a specialized implementation for every type, which returns a different pointer, e.g. extern __shared__ float* shared_mem_float or extern __shared__ int* shared_mem_int etc.

    A nicer implementation

    In my own cuda-kat library, there's a facility for that. You just write:

    auto foo = kat::shared_memory::dynamic::proxy<T>();
    

    and foo is a T* to your shared memory. You can also write:

    auto n = kat::shared_memory::dynamic::size<T>();
    

    which gets you the number of elements of type T fitting into the allocated dynamic shared memory.

    Naturally, I'm partial to my own solution, so - choose whatever works for you.


    (*) - Not really. in NVidia's supplied header file they specialize for some basic types and that's that.