Tensorflow Object Detection API GPU memory issue

I'm currently trying to train a model based off the model detection zoo for object detection. Running the setup on the CPU works as expected but trying the same on my GPU results in the following error.

2021-03-10 11:46:54.286051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:46:54.751423: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764147: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764233: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support

Monitoring the GPU information within task manager it seems that tensorflow (as expected as far as I've understood it) trys to allocate the whole memory. Shortly after reaching a specific peak (roughly 7.3 Gb of 8Gb) TF crashes with the error seen in the snippet above.

Solutions for this specific error within the internet / stackoverflow mention that this problem can be solved allowing the dynamic memory growth. Doing this seems to work and TF manages to create atleast one new checkpoint but in the end crashes with an error of a similar category. In that case the CUDA_OUT_OF_MEMORY error.

System Information:

Ryzen 5
16 GB RAM
RTX 2060 Super with 8Gb VRAM

Training Setup:

Tensorflow 2.4
CUDA 11.0 (also tried several combinations of CUDA cuDNN versions)
cuDNN 8.0.4

Originally I wanted to use the pretrained EfficientDet D6 model but also tried several others like EfficientDet D4, CenterNet HourGlass 512x512 and the SSD MobileNet V2 FPNLite. All those models were started with different batch sizes but even on a batch size of 1 the problem still occures. The training images aren't large either (in average 600 x 800). Currently there are a total of 30 images, 15 per class for training (I'm aware that the training data set should be bigger, but it's just to test the setup).

Now my question would be if anybody has an educated guess or another approach of finding the cause of this error as I cannot immagine that my 2060 isn't capable of atleast training a SSD with a batch size of 1 and rather small images. Could it be a hardware fault? If so, is there a way to check that?

Solution

I've done a complete reinstallation of every involving component. I might have done something different this time but I cannot say what. Atleast I'm now able to utilize the GPU for training.