Search code examples
tensorflowdeep-learningtraining-datavgg-netmodel-fitting

Is it normal to get ETA: 6:43:26 hours to complete the first epoch


I have crated the below vgg16 based CNN and I want to train it for 50 epochs. but it shows nearly 7 hours (ETA: 6:43:26) to complete the first epoch. could anyone please tell me is this normal with 209222 training images and 40000 validation images(DeepFashion dataset) ? or is this any issue with my steps_per_epoch? I use a HPC with 16 workers to train this model.

  train_gen = ImageDataGenerator(rescale=1./255)

  val_gen = ImageDataGenerator(rescale=1./255)

  train_batches = train_gen.flow_from_directory(train_path,
          target_size=(img_r, img_c),
          batch_size=batch_size,
          class_mode='categorical',
          shuffle=True)
          
  val_batches = val_gen.flow_from_directory(validation_path,
          target_size=(img_r, img_c),
          batch_size=batch_size_val,
          class_mode='categorical',
          shuffle=False)
  
  return train_batches, val_batches



def fit_model(model, batches, val_batches):

    print("started model training")
    history = model.fit(train_batches,
                                  steps_per_epoch = 209222/32,
                                  epochs = 50,
                                  validation_data= val_batches,
                                  validation_steps=40000/32,
                                  verbose=1,
                                  use_multiprocessing=True,
                                  workers=16
                                  )

this is the model part

def create_model(input_shape, output_classes):
    logging.debug('input_shape {}'.format(input_shape))
    logging.debug('input_shape {}'.format(type(input_shape)))
    
    #optimizer_mod = keras.optimizers.SGD(lr=0.001, momentum=momentum, decay=decay, nesterov=False)
    
    vgg16 = VGG16(weights='imagenet',include_top=False)
  
    for layer in vgg16.layers[:15]:
        layer.trainable = False
    
    x= vgg16.get_layer('block4_conv3').input
    x = vgg16.get_layer('block4_conv3')(x)
  
    if True:
        x = Reshape([28*28,512])(x)
        att = MultiHeadsAttModel(l=28*28, d=512 , dv=64, dout=512, nv = 8 )
        x = att([x,x,x])
        x = Reshape([28,28,512])(x)   
        x = BatchNormalization()(x)
        
    #x = vgg16.get_layer('block5_conv1')(x)
    #x = vgg16.get_layer('block5_conv2')(x)
    #x = vgg16.get_layer('block5_conv3')(x)
    #x = vgg16.get_layer('block5_pool')(x)
    
    x = Flatten()(x)
    x = Dense(256, activation="relu")(x)
    x = Dropout(0.5)(x)
    outputs = Dense(output_classes, activation='softmax')(x)
    
    
    model =tf.keras.Model(inputs=vgg16.input, outputs=outputs)
    
    top3_acc = functools.partial(keras.metrics.top_k_categorical_accuracy, k=3)
    top3_acc.__name__ = 'top3_acc' 
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    
    model.compile(
                  optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy',top3_acc]) 

    return model

Solution

  • if you are using VGG then you should rescale the values between -1 and +1 as

    that is how it was trained so use

    rescale=1/127.5=1
    ```
    That will not solve your long epoch 1 problem however. 
    For steps_per_epoch and validation steps use
    

    steps_per_epoch= 209222//32+1 validation_steps= 40000//32 +1

    That will also not solve the problem I suspect. 
    Each training epoch will require 6539 steps and each validation 
    will require 1251 steps. This is really rather large.
    Now the processing time will be greatly dependent on the image size. 
    What values did you use?
    Also the VGG model has on the order of 40 million trainable parameters 
    so it is computationally intensive to begin with. I would recommend 
    using the Mobilenet model which has on the order of 4 million parameters
    and is about as accurate. As noted by Edwin Cheong above  you need to
    check if your GPU is being used. I suspect it is not.