Runtime cudaErrorInsufficientDriver error from cudaGetDeviceCount when compiling with nvcc, icpc

PROBLEM

I have an FFT-based application that uses FFTW3. I am working on porting the application to a CUDA-based implementation using CUFFT. Compiling and running the FFT core of the application standalone within Nsight works fine. I have moved from there to integrating the device code into my application.

When I run using with the CUFFT core code integrated into my application, cudaGetDeviceCount returns a cudaErrorInsufficientDriver error, although I did not get it with the Nsight standalone run. This call is made at the beginning of the run when I'm initializing the GPU.

BACKGROUND

I am running on CentOS 6, using CUDA 7.0 on a GeForce GTX 750, and icpc 12.1.5. I have also successfully tested a small example using a GT 610. Both cards work in Nsight (and I've also compiled and run command-line without problems, though not as extensively as from within Nsight).

To integrate the CUFFT implementation of the FFT core into my application, I compiled and device-linked with nvcc and then used icpc (the Intel C++ Compiler) to compile the host code and to link the device and host code to create a .so. I finally completed that step without errors or warnings (relying on this tutorial).

(The reasoning as to why I'm using a .so has a fair amount of history and additional background. Suffice it to say that making a .so is required for my application.)

The tutorial points out that compilation steps are different between generating the standalone executable (as I do in Nsight) and generating a device-linked library for inclusion in a .so. To get through the compilation, I had to add -lcudart as described in the tutorial, as well as -lcuda, to my icpc linking call (as well as the -L to add .../cuda-7.0/lib64 and .../cuda-7.0/lib64/stubs as the paths to those libraries).

NOTE: nvcc links in libcudart by default. I'm assuming it does the same for libcuda since Nsight doesn't include either of these libraries in any of the compile and linking steps.. As an aside, I do find it strange that although nvcc links them in by default, they don't show up from a call to ldd on the executable.

I also had to add --compiler-options '-fPIC' to my nvcc commands to avoid errors described here.

I have seen some chatter (for one example, see this post) about Intel/NVCC compatibilities, but it looks like they arise at compile-time with older versions of NVCC, so...I think I'm ok on that account.

Finally, here are the compile commands for compilation of three .cu files (all are identical except for the name of the .cu file and the name of the .o file):

nvcc
-ccbin g++
-Iinc
-I/path/to/cuda/samples/common/inc
-m64
-O3
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_37,code=sm_37
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_52,code=compute_52
--relocatable-device-code=true
--compile
--compiler-options '-fPIC'
-o my_object_file1.o
-c my_source_code_file1.cu

And here are the flags I pass to the device linking step:

nvcc
-ccbin g++
-Iinc
-I/path/to/cuda/samples/common/inc
-m64
-O3
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_37,code=sm_37
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_52,code=compute_52
--compiler-options '-fPIC'
--device-link
my_object_file1.o
my_object_file2.o
my_object_file3.o
-o my_device_linked_object_file.o

I probably don't need the -gencode flags for 30, 37, and 52, at least currently, but they shouldn't cause any problems, and eventually, I will likely compile that way.

And here are my compiling flags (minus the -o flag, and all my -I flags) that I use for the .cc file that uses calls my CUDA library:

-c
-fpic
-D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64
-fno-operator-names
-D_REENTRANT
-D_POSIX_PTHREAD_SEMANTICS
-DM2KLITE -DGCC_
-std=gnu++98
-O2
-fp-model source
-gcc
-wd1881
-vec-report0

Finally, here are my linking flags:

-pthread
-shared

Any ideas on how to fix this problem?

Solution

Don't add to LD_LIBRARY_PATH .../cuda7.0/lib64/stubs. If you do, you will pick up libcuda.so from there instead of from the driver. (See this post).