Search code examples
kerasconv-neural-networktransfer-learning

Keras predict_generator outputs a different number of samples


I'm trying to improve the performance of a transfer learning model that uses Xception as the pre-trained model by using data augmentation. The goal is to classify dog breeds. train_tensors and valid_tensors contain the training and testing images respectively in a numpy array.

from keras.applications.xception import Xception 

model = Xception(include_top = False, weights = "imagenet")


datagen = ImageDataGenerator(zoom_range=0.2, 
                             horizontal_flip=True, 
                             width_shift_range = 0.2, 
                             height_shift_range = 0.2,
                             fill_mode = 'nearest',
                             rotation_range = 45)
batch_size = 32

bottleneck_train = model.predict_generator(datagen.flow(train_tensors, 
                                                        train_targets, 
                                                        batch_size = batch_size), 
                                          train_tensors.shape[0]// batch_size)

bottleneck_valid = model.predict_generator(datagen.flow(valid_tensors, 
                                                        valid_targets, 
                                                        batch_size = batch_size), 
                                           test_tensors.shape[0]//batch_size)



print(train_tensors.shape)
print(bottleneck_train.shape)

print(valid_tensors.shape)
print(bottleneck_valid.shape)

However, the output from the last 4 lines is :

(6680, 224, 224, 3)
(6656, 7, 7, 2048)
(835, 224, 224, 3)
(832, 7, 7, 2048)

The predict_generator function is returning a number of samples different than what it provided to it. Are samples being skipped or left out?


Solution

  • Yes, some samples are being left out, this is because 6680 and 835 does not exactly divide by 32 (your batch size), you could adjust the batch size so it divides both numbers exactly.

    Or you could just adjust the code to include one additional batch (which will have a slightly smaller size) by using the math.ceil python function:

    import math
    bottleneck_train = model.predict_generator(datagen.flow(train_tensors, 
                                                        train_targets, 
                                                        batch_size = batch_size), 
                                          math.ceil(train_tensors.shape[0] / batch_size))
    
    bottleneck_valid = model.predict_generator(datagen.flow(valid_tensors, 
                                                        valid_targets, 
                                                        batch_size = batch_size), 
                                           math.ceil(test_tensors.shape[0] /batch_size))