Pytorch CUDA out of memory despite plenty of memory left

I am training a Huggingface model using their trainer Python module. To be fair, I have refactored my code a bit, but a very similar code was actually working perfectly with way larger datasets than the one I am supplying right now, as well as higher per_device_train_batch_size (now equal to 8, still crashing, 16 used to work).

However, I am getting out of memory error, which is pretty weird...

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 1.54 GiB already allocated; 5.06 GiB free; 1.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

That error is what got me wondering, because it's trying to allocate 20.00 MiB and 5.06 GiB is seemingly free, so why does it crash?

My PyTorch version is '1.12.1+cu113', after I run torch.version.cuda I get 11.3

Thanks for all the help

Solution

My guess is that your CUDA drive is not set up correctly.

You need to install these two correctly also as a prereq NVIDIA CUDA 11.0 or above NVIDIA cuDNN v7 or above