tensorflow keras object-detection tensorflow-ssd

Training in SSD Implementation in Keras halts after few iteration without any output or error

After few iterations of first epoch the training process halts without any output or error message. SSD implementation in Keras was used from https://github.com/rykov8/ssd_keras

    base_lr = 3e-4
#optim = keras.optimizers.Adam(lr=base_lr)
optim = keras.optimizers.RMSprop(lr=base_lr)
#optim = keras.optimizers.SGD(lr=base_lr, momentum=0.9, decay=decay, nesterov=True)
model.compile(optimizer=optim,
              loss=MultiboxLoss(NUM_CLASSES+1, neg_pos_ratio=2.0).compute_loss)



nb_epoch = 10
history = model.fit_generator(gen.generate(True), gen.train_batches,
                              nb_epoch, verbose=1,
                              callbacks=None,
                              validation_data=gen.generate(False),
                              nb_val_samples=gen.val_batches,
                              nb_worker=1
                               )

The output of the program is as below:

    Epoch 1/10
/home/deepesh/Documents/ssd_traffic/ssd_utils.py:119: RuntimeWarning: divide by zero encountered in log
  assigned_priors_wh)
2017-10-15 18:00:53.763886: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:02.602807: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:03.831092: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:03.831138: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:04.774444: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:05.897872: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:05.897923: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:09.133494: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:09.133541: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:11.266114: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
13/14 [==========================>...] - ETA: 9s - loss: 2.9617

There is no output or error message after that.

Solution

You don't have enough memory, things you can do to solve the problem:

reduce the batch size
reduce the size of the train data
train your models in clouds (AMS, Google cloud and etc)
use another GPU card with more memory
or try CPU