Search code examples
pythontensorflowkerasdeep-learningefficientnet

transfer learning - trying to retrain efficientnet-B07 on RTX 2070 out of memory


this is the training code I am trying to run work when trying on 64gb ram CPU crush on RTX 2070

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
tf.keras.backend.set_session(tf.Session(config=config))

model = efn.EfficientNetB7()
model.summary()

# create new output layer
output_layer = Dense(5, activation='sigmoid', name="retrain_output")(model.get_layer('top_dropout').output)
new_model = Model(model.input, output=output_layer)
new_model.summary()
# lock previous weights

for i, l in enumerate(new_model.layers):
    if i < 228:
        l.trainable = False
# lock probs weights

new_model.compile(loss='mean_squared_error', optimizer='adam')

batch_size = 5
samples_per_epoch = 30
epochs = 20

# generate train data
train_datagen = ImageDataGenerator(
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0)

train_generator = train_datagen.flow_from_directory(
    train_data_input_folder,
    target_size=(input_dim, input_dim),
    batch_size=batch_size,
    class_mode='categorical',
    seed=2019,
    subset='training')

validation_generator = train_datagen.flow_from_directory(
    validation_data_input_folder,
    target_size=(input_dim, input_dim),
    batch_size=batch_size,
    class_mode='categorical',
    seed=2019,
    subset='validation')

new_model.fit_generator(
    train_generator,
    samples_per_epoch=samples_per_epoch,
    epochs=epochs,
    validation_steps=20,
    validation_data=validation_generator,
    nb_worker=24)

new_model.save(model_output_path)



exception:

2019-11-17 08:52:52.903583: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally .... ... 2019-11-17 08:53:24.713020: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 110 Chunks of size 27724800 totalling 2.84GiB 2019-11-17 08:53:24.713024: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 6 Chunks of size 38814720 totalling 222.10MiB 2019-11-17 08:53:24.713027: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 23 Chunks of size 54000128 totalling 1.16GiB 2019-11-17 08:53:24.713031: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 73760000 totalling 70.34MiB 2019-11-17 08:53:24.713034: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 5.45GiB 2019-11-17 08:53:24.713040: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit: 5856749158 InUse: 5848048896 MaxInUse: 5848061440 NumAllocs: 6140 MaxAllocSize: 3259170816

2019-11-17 08:53:24.713214: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **************************************************************************************************** 2019-11-17 08:53:24.713232: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[5,1344,38,38] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "/home/naort/Desktop/deep-learning-data-preparation-tools/EfficientNet-Transfer-Learning-Boiler-Plate/model_retrain.py", line 76, in nb_worker=24) File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1732, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 220, in fit_generator reset_metrics=False) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1514, in train_on_batch outputs = self.train_function(ins) File "/home/naort/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in call run_metadata=self.run_metadata) File "/home/naort/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in call run_metadata_ptr) File "/home/naort/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5,1344,38,38] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training/Adam/gradients/AddN_387-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node Mean}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Solution

  • Despite the EfficientNet models having lower parameter counts than comparative ResNe(X)t models, they still consume significant amounts of GPU memory. What you're seeing is an out of memory error for your GPU (8GB for an RTX 2070), not the system (64GB).

    A B7 model, especially at full resolution, is beyond what you'd want to use for training with a single RTX 2070 card. Even if freezing a lot of layers.

    Something that may help, is running the model in FP16, which will also leverage the TensorCores of your RTX card. From https://medium.com/@noel_kennedy/how-to-use-half-precision-float16-when-training-on-rtx-cards-with-tensorflow-keras-d4033d59f9e4, try this:

    import keras.backend as K
    
    dtype='float16'
    K.set_floatx(dtype)
    
    # default is 1e-7 which is too small for float16.  Without adjusting the epsilon, we will get NaN predictions because of divide by zero problems
    K.set_epsilon(1e-4)