I have a kernel, which might call asm("trap;")
inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.
In CUDA Runtime API, we can use cudaGetLastError
to get the last error and in the mean time, reset it to cudaSuccess
.
Is there a way to do that with Driver API?
This type of error cannot be reset with the CUDA Runtime API cudaGetLastError()
function.
There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc
request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".
Errors that involve unexpected termination of a CUDA kernel (including your trap
example, also in-kernel assert()
failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError()
. The only method to clear these errors in the runtime API is cudaDeviceReset()
(which eliminates all device allocations, and wipes out the context).
The corresponding driver API function is cuDevicePrimaryCtxReset()
Note that cudaDeviceReset()
by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.
Note that in recent versions of CUDA, there may be situations where multiple calls to cudaGetLastError()
will cause the error reporting to change back to cudaSuccess
, even in the presence of a sticky error. However, in the presence of a sticky error, any further attempt to make meaningful use of the runtime API at that point, will again result in the reporting of the sticky error, until such time as the context is destroyed, eg. with cudaDeviceReset()
or owning process termination.