I'm following this guide without changing anything. I'm using an aws server with deep learning ami: Deep Learning AMI (Ubuntu 18.04) Version 40.0
I've tried to change my custom dataset to the coco dataset and to a small subset of the custom one. batch size doesn't seems to matter, CUDA and other drivers seems to work.
The exception is thrown when the batch starts the training process. This is the full stack trace:
Logging results to runs/train/exp66
Starting training for 5 epochs...
Epoch gpu_mem box obj cls total targets img_size
0%| | 0/22 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 533, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 298, in train
pred = model(imgs) # forward
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/yolo.py", line 121, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "/home/ubuntu/yolov5/models/yolo.py", line 137, in forward_once
x = m(x) # run
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/common.py", line 113, in forward
return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/common.py", line 38, in forward
return self.act(self.bn(self.conv(x)))
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
I fixed it using conda, I cloned the pytorch environment one which came with the image and it works perfectly. I still don't know the cause though.