Search code examples
c++templatescudanvcc

NVIDIA NVCC changes compile time constant when using template trait types


I see a strange behaviour of NVIDIA NVCC (CUDA 4.0 and 4.1 tested) when using C++ templates. I cooked it down to a simple example that demonstrates the behaviour.

This is already in the state of a bug report. However, I hook it up here since this site is a growing reliable source for bugs and fixes. So, I keep this page updated.

Code:

#include"stdio.h"

#define PETE_DEVICE __device__

template<class T, int N>  class ILattice;
template<class T>         class IScalar;
template<class T, int IL> struct AddILattice {};

template<class T>
PETE_DEVICE
void printType() {
  printf("%s\n",__PRETTY_FUNCTION__);
}

template<class T> class IScalar {
  T F;
};

template<class T, int N> class ILattice {
  T F[N];
};

template<class T, int N>
struct AddILattice<IScalar<T> , N> {
  typedef ILattice< T , N > Type_t;
};

#define IL 16

__global__ void kernel()
{
  printf("IL=%d\n",IL);  // Here IL==16

  typedef typename AddILattice<IScalar<float> ,IL>::Type_t Tnew;

  // This still works fine. Output:
  // void printType() [with T = ILattice<float, 16>]
  //
  printType<Tnew>();

  // Now problems begin: Output:
  // T=4 Tnew=0 IL=64
  // Here IL should still be 16
  // sizeof(Tnew) should be 16*sizeof(float)
  //
  printf("T=%d Tnew=%d IL=%d\n",sizeof(IScalar<float> ),sizeof(Tnew),IL);   
}   

int main()
{
    dim3  blocksPerGrid( 1 , 1 , 1 );
    dim3  threadsPerBlock( 1 , 1, 1);
    kernel<<< blocksPerGrid , threadsPerBlock , 48*1024 >>>( );

    cudaDeviceSynchronize();
    cudaError_t kernel_call = cudaGetLastError();
    printf("call: %s\n",cudaGetErrorString(kernel_call));

}

Any ideas why the compiler changes IL from 16 to 64 ??


Solution

  • Maybe because you use the wrong printf conversion. %d means to output an int, but sizeof returns no int, but a size_t. Use in addition the size_t length modifier (and make it unsigned), i.e. replace %d with %zu.

    The printf cannot know (due to the var-args list) what types are really passed, and so no type conversion happens, it can only know the type due the format string. So you MUST pass the right parameters there. When you are on a system where size_t has the same size as int, your code works (e.g. many 32-bit systems). But you cant rely on that fact, and using the right conversions will help you there.

    (so its not the compiler changing your constant, but you just output it wrong)