Search code examples
chainer

What problems can lead to a CuDNNError with ConvolutionND


I am using three-dimensional convolution links (with ConvolutionND) in my chain.

The forward computation run smoothly (I checked intermediate result shapes to be sure I understood correctly the meaning of the parameters of convolution_nd), but during the backward a CuDNNError is raised with the message CUDNN_STATUS_NOT_SUPPORTED.

The cover_all parameter of ConvolutionND as its default value of False, so from the doc I don't see what can be the cause of the error.

Here is how I defind one of the convolution layers :

self.conv1 = chainer.links.ConvolutionND(3, 1, 4, (3, 3, 3)).to_gpu(self.GPU_1_ID)

And the call stack is

File "chainer/function_node.py", line 548, in backward_accumulate
    gxs = self.backward(target_input_indexes, grad_outputs)
File "chainer/functions/connection/convolution_nd.py", line 118, in backward
    gy, W, stride=self.stride, pad=self.pad, outsize=x_shape)
File "chainer/functions/connection/deconvolution_nd.py", line 310, in deconvolution_nd
    y, = func.apply(args)
File chainer/function_node.py", line 258, in apply
    outputs = self.forward(in_data)
File "chainer/functions/connection/deconvolution_nd.py", line 128, in forward
    return self._forward_cudnn(x, W, b)
File "chainer/functions/connection/deconvolution_nd.py", line 105, in _forward_cudnn
    tensor_core=tensor_core)
File "cupy/cudnn.pyx", line 881, in cupy.cudnn.convolution_backward_data
File "cupy/cuda/cudnn.pyx", line 975, in cupy.cuda.cudnn.convolutionBackwardData_v3
File "cupy/cuda/cudnn.pyx", line 461, in cupy.cuda.cudnn.check_status
cupy.cuda.cudnn.CuDNNError: CUDNN_STATUS_NOT_SUPPORTED

So are there special points to take care of when using ConvolutionND ?

A failing code is for instance :

import chainer
from chainer import functions as F
from chainer import links as L
from chainer.backends import cuda

import numpy as np
import cupy as cp

chainer.global_config.cudnn_deterministic = False

NB_MASKS = 60
NB_FCN = 3
NB_CLASS = 17

class MFEChain(chainer.Chain):
    """docstring for Wavelphasenet."""
    def __init__(self,
                 FCN_Dim,
                 gpu_ids=None):
        super(MFEChain, self).__init__()

        self.GPU_0_ID, self.GPU_1_ID = (0, 1) if gpu_ids is None else gpu_ids
        with self.init_scope():
            self.conv1 = chainer.links.ConvolutionND(3, 1, 4, (3, 3, 3)).to_gpu(
                self.GPU_1_ID
            )

    def __call__(self, inputs):
        ### Pad input ###
        processed_sequences = []
        for convolved in inputs:
            ## Transform to sequences)
            copy = convolved if self.GPU_0_ID == self.GPU_1_ID else F.copy(convolved, self.GPU_1_ID)
            processed_sequences.append(copy)

        reprocessed_sequences = []
        with cuda.get_device(self.GPU_1_ID):
            for convolved in processed_sequences:
                convolved = F.expand_dims(convolved, 0)
                convolved = F.expand_dims(convolved, 0)
                convolved = self.conv1(convolved)

                reprocessed_sequences.append(convolved)

            states = F.vstack(reprocessed_sequences)

            logits = states

            ret_logits = logits if self.GPU_0_ID == self.GPU_1_ID else F.copy(logits, self.GPU_0_ID)
        return ret_logits

def mfe_test():
    mfe = MFEChain(150)
    inputs = list(
        chainer.Variable(
            cp.random.randn(
                NB_MASKS,
                11,
                in_len,
                dtype=cp.float32
            )
        ) for in_len in [53248]
    )
    val = mfe(inputs)
    grad = cp.ones(val.shape, dtype=cp.float32)
    val.grad = grad
    val.backward()
    for i in inputs:
        print(i.grad)

if __name__ == "__main__":
    mfe_test()

Solution

  • cupy.cuda.cudnn.convolutionBackwardData_v3 is incompatible with some specific parameters, as described in an issue in official github.

    Unfortunately, the issue only dealt with deconvolution_2d.py (not deconvolution_nd.py), therefore the decision-making about whether cudnn is used or not failed in your case, I guess.

    you can check your parameter by confirming

    1. check whether dilation parameter (!=1) or group parameter (!=1) is passed to the convolution.
    2. print chainer.config.cudnn_deterministic, configuration.config.autotune, and configuration.config.use_cudnn_tensor_core.

    Further support may be obtained by raising an issue in the official github.

    The code you showed is much complicated.

    To clarify the problem, the code below would help.

    from chainer import Variable, Chain
    from chainer import links as L
    from chainer import functions as F
    
    import numpy as np
    from six import print_
    
    batch_size = 1
    in_channel = 1
    out_channel = 1
    
    class MyLink(Chain):
        def __init__(self):
            super(MyLink, self).__init__()
            with self.init_scope():
                self.conv = L.ConvolutionND(3, 1, 1, (3, 3, 3), nobias=True, initialW=np.ones((in_channel, out_channel, 3, 3, 3)))
    
        def __call__(self, x):
            return F.sum(self.conv(x))
    
    if __name__ == "__main__":
        my_link = MyLink()
        my_link.to_gpu(0)
        batch = Variable(np.ones((batch_size, in_channel, 3, 3, 3)))
        batch.to_gpu(0)
        loss = my_link(batch)
        loss.backward()
        print_(batch.grad)