CUDA Runtime difference between release mode and debug mode

I am running Visual Studio 2013. I am running CUDA 7.0.28

I can toggle the runtime difference just by checking or unchecking the CUDA option :

Generate GPU debug Information.

I have the device kernel running with a <<<1,1>>> and the error occurs even then.

My questions are :

Why would it give me different results in the release and debug mode?
What kind of things should i be looking for to try and track down why this is occurring.
Is there a way to break point within the kernel function? It does not appear so. Besides making printf statements what other means can i use to trace down the problem?

Thank you.

Solution

Why would it give me different results in the release and debug mode?

Under the hood, machine code generation from CUDA C/C++ source code will look very different in debug mode. The list of differences is too long to cover here, but mostly they are summarized by all compiler optimizations are turned off in debug mode. This can give rise to race conditions, for example, that are evident in debug but not release or vice versa.

What kind of things should i be looking for to try and track down why this is occurring.

I would start with the simplest tools. Use cuda-memcheck first by itself to confirm that the kernel is running without generating basic errors. If cuda-memcheck reports that your kernel is failing, follow the method here to isolate the failure to a single line of source code. After fixing any errors reported in this fashion by cuda-memcheck, use the cuda-memcheck subtool options including racecheck, synccheck, and initcheck, to see if any of these catch problems.

Is there a way to break point within the kernel function?

Yes, there are debuggers available both on windows, and linux. On windows the debugger is integrated into Visual Studio. There is documentation available, walkthroughs, and even youtube videos demonstrating how to perform various operations, such as setting a breakpoint. I wouldn't go down this path before using cuda-memcheck however.