UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR

I'm trying to train a model with Yolov8. Everything was good but today I suddenly notice getting this warning apparently related to PyTorch and cuDNN. In spite the warning, the training seems to be progressing though. I'm not sure if it has any negative effects on the training progress.

site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

What is the problem and how to address this?

Here is the output of collect_env:

Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31
Python version: 3.9.7 | packaged by conda-forge | (default, Sep  2 2021, 17:58:34)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 515.105.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                    x86_64

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.17.3
[pip3] onnxruntime-gpu==1.17.1
[pip3] onnxsim==0.4.36
[pip3] optree==0.11.0
[pip3] torch==2.3.0+cu118
[pip3] torchaudio==2.3.0+cu118
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] pytorch-quantization      2.2.1                    pypi_0    pypi
[conda] torch                     2.1.1+cu118              pypi_0    pypi
[conda] torchaudio                2.1.1+cu118              pypi_0    pypi
[conda] torchmetrics              0.8.0                    pypi_0    pypi
[conda] torchvision               0.16.1+cu118             pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi

Solution

In version 2.3.0 of pytorch, it prints this unwanted warning even if no exception is thrown: see https://github.com/pytorch/pytorch/pull/125790

As you mentioned, though, the training is processing correctly. If you want to get rid of this warning, you should revert to torch 2.2.2 (you then also have to revert torchvision to 0.17.2):

pip3 install torchvision==0.17.2
pip3 install torch==2.2.2