Search code examples
releasemxnetcudnn

CuDNN code gives CUDNN_STATUS_EXECUTION_FAILED status only in release


I am compiling a git version of the MXNet framework, which use CuDNN inside its code. Whenever MXNet is compiled in debug, my example test is running fine and my neural network is training. However, when I switch to release mode, the execution fails a test and I get the following error: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED.

Note: I don't see any release/debug code which could explain a different behaviour. And I didn't had any problem at all with both release and debug version until I activated CuDNN, thus I trust it is the culprit.

The symptoms: - The code doesn't necessarily crash at the same location. But it is always during a CUDNN_CALL (which is a macro that calls a CuDNN function and check the status). - No memory is allocated on my GPU, which has anyway enough memory for such network, thus it shouldn't be a problem. - It happens only in release - in debug, it is running just fine.

Here is an example of where I get the error:

CUDNN_CALL(cudnnAddTensor(s->dnn_handle_,
                                &alpha,
                                bias_desc_,
                                bias.dptr_ + bias_offset_ * g,
                                &beta_add,
                                out_desc_,
                                out_ptr + out_offset_ * g));

So, what could be the causes of such a problem?


Solution

  • For some reason, updating the version of CuDNN to 7.4 did the trick for me. So I guess, it was really a problem with CuDNN on my side. I can only hypothesize that a bug fix solved my problem, or I was using a version which was not fully compatible to my GPU, etc.