python tensorflow keras object-detection

Odd shape in tensor while training & ResourceExhaustedError: OOM when allocating tensor

I'm trying to run object detection using this github repo whicch leverages the simple 7-layer Single Shot MultiBox Detector. I ran this on Google Colab using packages : keras==2.2.4 & tensorflow-gpu==1.13.1

and eventually I ran into this bug down below while training. Another thing that I want to complain about is the shape of the tensor that caused it to crash has a shape [2,1232,1640,48] where...

2 is the batch size
1232 is half of the width (oddly enough)
1640 is half of the height (oddly enough)
Not sure where 48 comes into the picture

Epoch 1/5

---------------------------------------------------------------------------

ResourceExhaustedError                    Traceback (most recent call last)

<ipython-input-28-3fbd9e60a593> in <module>()
     19 
     20                               max_queue_size=1,
---> 21                               workers=0)

7 frames

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1416             use_multiprocessing=use_multiprocessing,
   1417             shuffle=shuffle,
-> 1418             initial_epoch=initial_epoch)
   1419 
   1420     @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    215                 outs = model.train_on_batch(x, y,
    216                                             sample_weight=sample_weight,
--> 217                                             class_weight=class_weight)
    218 
    219                 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1215             ins = x + y + sample_weights
   1216         self._make_train_function()
-> 1217         outputs = self.train_function(ins)
   1218         return unpack_singleton(outputs)
   1219 

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2673             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2674         else:
-> 2675             fetched = self._callable_fn(*array_vals)
   2676         return fetched[:len(self.outputs)]
   2677 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1437           ret = tf_session.TF_SessionRunCallable(
   1438               self._session._session, self._handle, args, status,
-> 1439               run_metadata_ptr)
   1440         if run_metadata:
   1441           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

ResourceExhaustedError: OOM when allocating tensor with shape[2,1232,1640,48] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training/Adam/gradients/zeros_22-0-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node loss/add_14}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Please clarify what's happening and how to get around this. I can also share more pertinent details about the model structure if that's something that will aid in finding the bug.

Solution

If you have data with shape

(2, 1232, 1640, 3)

After it passes through a convolutional layer with 42 filters with "SAME" padding, it will have shape

(2, 1232, 1640, 42)

And there's no place for this tensor on your GPU.

I looked in the repo, there's a bunch of layers with 48 filters

conv2 = Conv2D(48, (3, 3), strides=(1, 1), padding="same", 
    kernel_initializer='he_normal', 
    kernel_regularizer=l2(l2_reg), name='conv2')(pool1)
conv2 = BatchNormalization(axis=3, momentum=0.99, name='bn2')(conv2)
conv2 = ELU(name='elu2')(conv2)
pool2 = MaxPooling2D(pool_size=(2, 2), name='pool2')(conv2)