Tensorflow recognizes my GPU, but when I use it to train a model RAM usage explodes and crashes

The Code I'm working with is as follows, this first block runs perfectly without problems:

import tensorflow as tf
data = tf.keras.datasets.fashion_mnist

class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('accuracy')>0.99):
      print("\nReached 99% accuracy so cancelling training!")
      self.model.stop_training = True


callbacks = myCallback()


(training_images, training_labels), (test_images, test_labels) = data.load_data()
print(type(training_images))
training_images=training_images.reshape(60000, 28, 28, 1)
training_images  = training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

model.fit(training_images, training_labels, epochs=50, callbacks=[callbacks], verbose=1)

This is actually an example code from the "AI and Machine Learning for Coders" book. I can create the model and all, but when I call the fit method:

model.fit(training_images, training_labels, epochs=50, callbacks=[callbacks], verbose=1)

It prints "epoch 1/50", it freezes there, makes no progress and shows the following warning:

I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401

RAM usage explodes and after a while the environment crashes and it shows this error:

Process finished with exit code -1073740791 (0xC0000409)

I'm using:

tensorflow==2.9.0
tensorflow-gpu==2.1.0
CUDA   --> v11.7
CUDNN  --> v8.4.1.50
GPU    --> NVidia GeForce GTX 960 4GB

from tensorflow I can see my GPU with the command:

tf.config.list_logical_devices('GPU')

Which gives the following results:

2022-09-10 23:30:04.052968: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


2022-09-10 23:30:04.496787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2810 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 960, pci bus id: 0000:06:00.0, compute capability: 5.2
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]

Please I need help!

Solution

I was able to understand and solve the issue.

The problem was not with the tensorflow libraries, but with the CUDA and CUDNN versions.

The software requierments for GPU support from the Tensorflow documentation specify the set of drivers you need to install, and the versions dictated are not indicative but mandatory, different combinations of versions from those specified in this document (Software Requirements paragraph) will not work:

https://www.tensorflow.org/install/pip#windows-native

In my case I had to strictly set:

CUDA --> v11.2

CUDNN --> v8.1.0

after reinstalling the libraries, updating the PATH system variable and restarting the computer, I was able to conclude successfully the training process.

Hopefully this will help somebody else in my position.

Please also refer to Jeff Heaton's tutorial on this topic:

https://www.youtube.com/watch?v=qrkEYf-YDyI