Search code examples
cudapycuda

CUDA synchronize function fails during long running kernel


I'm using PyCuda to run a kernel that is expected to take at least two hours to complete, but it is failing after around one hour with the simple error of:

pycuda._driver.Error: cuCtxSynchronize failed: unknown error

I'm using Windows, and I added the registry key TdrDelay and set it to 120000000 to ensure that Windows is not timing out my kernel.

This error doesn't happen when I adjust the parameters of the kernel so it is expected to complete in about 30 minutes. Why could the synchronize call be failing after the kernel has run for a long time?

Could my graphics card be overheating and preemptively terminating the kernel? Could there be a CUDA setting that terminates a kernel if it runs for too long? Could running the kernel in NVidia Visual Profiler help figure out what the problem might be?


Solution

  • I was able to get my long running kernel to complete without error by adding the registry key "TdrLevel" alongside "TdrDelay" and setting its value to 0.