Search code examples
python-3.xscikit-learntensorflow2.0tf.kerask-fold

ImageDataGenerator.flow_from_directory to a dataset that can be used in Kfold


I am trying to use the cross validation approach for the model I use for classifying images into 3 classes. I use the following code to import images:

train_datagen = ImageDataGenerator(rescale=1./255)
data = train_datagen.flow_from_directory(directory=train_path,
                                       target_size=(300,205), batch_size=8, 
                                       color_mode='grayscale',class_mode='categorical')

It worked fine to train the model and test it before I tried using sklearn.model_selection's KFold. All the examples I find on the internet are simple numpy arrays, whereas I have a classification array. Meaning that the arrays of images have labels and I could not work anything around to convert this DirectoryIterator (flow_from_directory returns a DirectoryIterator) into an array that can be used with kfold.split function.

I tried the following approaches, please bear in mind I am new to classification models:

np_data = data.next()

num_folds = 5
kfold = KFold(n_splits=num_folds, shuffle=True)
for train, test in kfold.split(np_data):

Then I get: ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=2.

I believe I get this value error because np_array has 2 nested arrays inside, first for the images and second for their classes.

I would try to shuffle and kfold only the images, but then without the information what class they belong to I cannot train my model properly. I have tried following the guide in this link but the data for their testing and training seem to be imported in a different way than I have my data. Then I came across also this, but again it did not really help with my situation.

I have no idea what I am missing, any additional help will be much appreciated.

Lastly I have tried doing:

x, y = data.next()
for train, test in kfold.split(x, y):
     ...

This gives me the following error when it begins the first epoch of the first fold:

ValueError: No gradients provided for any variable: ['conv2d/kernel:0', 'conv2d/bias:0', 'conv2d_1/kernel:0', 'conv2d_1/bias:0', 'conv2d_2/kernel:0', 'conv2d_2/bias:0', 'conv2d_3/kernel:0', 'conv2d_3/bias:0', 'dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0'].


Solution

  • The reason I got the last ValueError was because I did not include y[test] when I used model.fit(). The following worked fine for me.

    After importing the images with ImageDataGenerator.flow_from_directory(...), x, y = data.next() yields images and their label into x and y arrays. Henceforth:

    kfold = KFold(n_splits=num_folds, shuffle=True)
    
    fold_no = 1
    for train, test in kfold.split(x, y):
       model = keras.models.Sequential(.....)
       model.fit(x[train], y[train], epochs=epochs)
       ...
       scores = model.evaluate(x[test], y[test], verbose=0)
       ...
       fold_no = fold_no + 1
    

    I also used this print line to keep track of the scores:

    print(f'Score for fold {fold_no}: {network.metrics_names[0]} of {scores[0]}; {network.metrics_names[1]} of {scores[1]*100}%')
    

    Additionally, loss and accuracy results can be stored in two separate arrays and get an average at the end of the folds.

    acc_per_fold.append(scores[1] * 100)
    loss_per_fold.append(scores[0])
    

    The above 2 lines have to be inside the for loop (for train, test in kfold.split(x, y):), and the below lines outside of it.

    print("\n\n Overall accuracy: " + str(np.average(acc_per_fold)))
    print("Overall loss: " + str(np.average(loss_per_fold)))