tensorflow out-of-memory tensorflow2.0 object-detection object-detection-api

Getting error "Resource exhausted: OOM when allocating tensor with shape[1800,1024,28,28] and type float on /job:localhost/..."

I am getting a resource exauhsted error when initiation training for my object detection Tensorflow 2.5 GPU model. I am using 18 training images and 3 test images. The pre-trained model I am using is the Faster R-CNN ResNet101 V1 640x640 model from Tensorflow zoo 2.2. I am using a Nvidia RTX 2070 with 8 GB dedicated memory to train my model.

The thing I am confused about is why the training process is taking up so much memory from my GPU when the training set is so small. This is the summary of GPU memory I get along with the error:

Limit:                      6269894656
InUse:                      6103403264
MaxInUse:                   6154866944
NumAllocs:                        4276
MaxAllocSize:               5786902272
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

I also decreased the batch size of the training data to 6, and of the testing data to 1.

Solution

Max memory usage during training is impacted by several factors and reducing the batch size is typically how to address memory constraints. Alexandre Leobons Souza's recommendation may help as well by giving Tensorflow more flexibility in allocating memory, but if you continue to see OOM errors, then I would recommend reducing batch size further. Alternatively, you could try limiting the trainable variables in the model, which will also result in lower memory usage during training.

You mentioned, "The thing I am confused about is why the training process is taking up so much memory from my GPU when the training set is so small.". Something to keep in mind is that during training, your training data will be used in a forward pass through the model and then you will calculate gradients for each trainable variable in a backwards pass. Even if your training data is small, the intermediary calculations (including the gradients) require memory. These calculations scale linearly with respect to your batch size and the model size. By reducing batch size or by reducing the number of trainable variables, training will require less memory.

One other suggestion, if the size of your input tensor is changing in your training data (i.e. if the number of ground truth bounding boxes goes from 1 to 2 and you are not padding the input tensor), this can cause Tensorflow to retrace the computation graph during training and you will see warnings. I'm not certain the impact to memory in this case, but suspect that each retrace effectively requires a duplicate model in memory. If this is the case you can try using @tf.function(experimental_relax_shapes=True).