Invalid device symbol error depending on nvcc flags

Toy program:

#include <iostream>
#include <vector>

// Matrix side size (they are square).
const int N = 3;
const int num_mats = 14;

// Rotation matrices.
__constant__ float rot_mats_device[num_mats*N*N];

int main() {
  std::vector<float> rot_mats_host(num_mats*N*N);
  for (int i = 0; i < rot_mats_host.size(); i++)
    rot_mats_host[i] = i;

  auto errMemcpyToSymbol = cudaMemcpyToSymbol(rot_mats_device,
                                              rot_mats_host.data(),
                                              sizeof(rot_mats_device));

  if (errMemcpyToSymbol != cudaSuccess) {
    std::cout << "MemcpyToSymbol error: " <<
      cudaGetErrorString(errMemcpyToSymbol) << std::endl;
  }
}

Compiled with

nvcc -arch=sm_52 -std=c++11 cuda_invalid_symbol_error.cu -o cuda_invalid_symbol_error

does not give any error during runtime. However, with

nvcc -gencode arch=compute_52,code=sm_52 -std=c++11 cuda_invalid_symbol_error.cu -o cuda_invalid_symbol_error

it will fail with the message MemcpyToSymbol error: invalid device symbol.

Why do the latter instructions for compilation give the runtime error?

Specs: cuda 8.0, Ubuntu 16.04, GeForce GTX 1060 (I know the cc of this card is 6.1).

Solution

Why do the latter instructions for compilation give the runtime error?

-arch=sm_xx is shorthand for:

-gencode arch=compute_xx,code=sm_xx -gencode arch=compute_xx,code=compute_xx

In your case, where xx is 52, this command embeds both cc 5.2 PTX code (the second gencode instance) and cc 5.2 SASS code (the first gencode instance). The SASS code for cc 5.2 will not run on your cc6.1 device, so the runtime JIT-compiles the cc 5.2 PTX code to create an object compatible with your cc 6.1 architecture. All is happy and everything works.

When you instead compile with:

nvcc -gencode arch=compute_52,code=sm_52 ...

you are omitting the PTX code from the compiled object. Only the cc 5.2 SASS code is present. This code will not run on your cc6.1 device, and the runtime has no other options, so a "hidden" error of NO_BINARY_FOR_GPU occurs when the runtime attempts to load the GPU image for your program. Since no image gets loaded, no device symbol is present/usable. Since it is not present/usable, you get the invalid device symbol error when you attempt to refer to it using the CUDA runtime API.

If you had performed another CUDA runtime API call prior to this which forced a sufficient or equivalent level of initialization of the CUDA runtime (and checked the returned error code), you would have received a NO_BINARY_FOR_GPU error or similar. Certainly, for example, if you had attempted to launch a GPU kernel, you would receive that error. There may be other CUDA runtime API calls that would force an equivalent or sufficient level of lazy initialization, but I don't have a list of those.