I have crated the below vgg16 based CNN and I want to train it for 50 epochs. but it shows nearly 7 hours (ETA: 6:43:26) to complete the first epoch. could anyone please tell me is this normal with 209222 training images and 40000 validation images(DeepFashion dataset) ? or is this any issue with my steps_per_epoch? I use a HPC with 16 workers to train this model.
train_gen = ImageDataGenerator(rescale=1./255)
val_gen = ImageDataGenerator(rescale=1./255)
train_batches = train_gen.flow_from_directory(train_path,
target_size=(img_r, img_c),
batch_size=batch_size,
class_mode='categorical',
shuffle=True)
val_batches = val_gen.flow_from_directory(validation_path,
target_size=(img_r, img_c),
batch_size=batch_size_val,
class_mode='categorical',
shuffle=False)
return train_batches, val_batches
def fit_model(model, batches, val_batches):
print("started model training")
history = model.fit(train_batches,
steps_per_epoch = 209222/32,
epochs = 50,
validation_data= val_batches,
validation_steps=40000/32,
verbose=1,
use_multiprocessing=True,
workers=16
)
this is the model part
def create_model(input_shape, output_classes):
logging.debug('input_shape {}'.format(input_shape))
logging.debug('input_shape {}'.format(type(input_shape)))
#optimizer_mod = keras.optimizers.SGD(lr=0.001, momentum=momentum, decay=decay, nesterov=False)
vgg16 = VGG16(weights='imagenet',include_top=False)
for layer in vgg16.layers[:15]:
layer.trainable = False
x= vgg16.get_layer('block4_conv3').input
x = vgg16.get_layer('block4_conv3')(x)
if True:
x = Reshape([28*28,512])(x)
att = MultiHeadsAttModel(l=28*28, d=512 , dv=64, dout=512, nv = 8 )
x = att([x,x,x])
x = Reshape([28,28,512])(x)
x = BatchNormalization()(x)
#x = vgg16.get_layer('block5_conv1')(x)
#x = vgg16.get_layer('block5_conv2')(x)
#x = vgg16.get_layer('block5_conv3')(x)
#x = vgg16.get_layer('block5_pool')(x)
x = Flatten()(x)
x = Dense(256, activation="relu")(x)
x = Dropout(0.5)(x)
outputs = Dense(output_classes, activation='softmax')(x)
model =tf.keras.Model(inputs=vgg16.input, outputs=outputs)
top3_acc = functools.partial(keras.metrics.top_k_categorical_accuracy, k=3)
top3_acc.__name__ = 'top3_acc'
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(
optimizer=opt,
loss='categorical_crossentropy',
metrics=['accuracy',top3_acc])
return model
if you are using VGG then you should rescale the values between -1 and +1 as
that is how it was trained so use
rescale=1/127.5=1
```
That will not solve your long epoch 1 problem however.
For steps_per_epoch and validation steps use
steps_per_epoch= 209222//32+1 validation_steps= 40000//32 +1
That will also not solve the problem I suspect.
Each training epoch will require 6539 steps and each validation
will require 1251 steps. This is really rather large.
Now the processing time will be greatly dependent on the image size.
What values did you use?
Also the VGG model has on the order of 40 million trainable parameters
so it is computationally intensive to begin with. I would recommend
using the Mobilenet model which has on the order of 4 million parameters
and is about as accurate. As noted by Edwin Cheong above you need to
check if your GPU is being used. I suspect it is not.