Search code examples
pythonvalidationtensorflowkerastraining-data

Keras ImageDataGenerator validation split not selected from shuffled dataset


How can I randomly split my image dataset into training and validation datesets? More specifically, the validation_split argument in Keras ImageDataGenerator function is not randomly splitting my images into training and validation but is slicing the validation sample from an unshuffled dataset.


Solution

  • When specifying the validation_split argument in Keras' ImageDataGenerator the split is performed before the data is shuffled such that only the last x samples are taken. The issue is that the last sample of data selected as validation may not be representative of the training data and so it can fail. This is an especially common dead end when your image data is stored in a common directory with each sub-folder named by class. The has been noted in several posts:

    Choose random validation data set

    As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

    The training accuracy is very high, while the validation accuracy is very low?

    please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.

    Does 'validation split' randomly choose validation sample?

    The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

    This answer points to the sklearn train_test_split() as a solution, but I want to propose a different solution that keeps consistency in the keras workflow.

    With the split-folders package you can randomly split your main data directory into training, validation, and testing (or just training and validation) directories. The class-specific subfolders are automatically copied.

    The input folder shoud have the following format:

    input/
        class1/
            img1.jpg
            img2.jpg
            ...
        class2/
            imgWhatever.jpg
            ...
        ...
    

    In order to give you this:

    output/
        train/
            class1/
                img1.jpg
                ...
            class2/
                imga.jpg
                ...
        val/
            class1/
                img2.jpg
                ...
            class2/
                imgb.jpg
                ...
        test/            # optional
            class1/
                img3.jpg
                ...
            class2/
                imgc.jpg
                ...
    

    From the documentation:

    import split_folders
    
    # Split with a ratio.
    # To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
    split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values
    
    # Split val/test with a fixed number of items e.g. 100 for each set.
    # To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
    split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values
    

    With this new folder arrangement you can easily use keras data generators to divide your data into training and validation and eventually train your model.

    import tensorflow as tf
    import split_folders
    import os
    
    main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
    output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'
    
    split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))
    
    train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        rescale=1./224)
    
    train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                        class_mode='categorical',
                                                        batch_size=32,
                                                        target_size=(224,224),
                                                        shuffle=True)
    
    validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                            target_size=(224, 224),
                                                            batch_size=32,
                                                            class_mode='categorical',
                                                            shuffle=True) # set as validation data
    
    base_model = tf.keras.applications.ResNet50V2(
        input_shape=IMG_SHAPE,
        include_top=False,
        weights=None)
    
    maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
    prediction_layer = tf.keras.layers.Dense(4, activation='softmax')
    
    model = tf.keras.Sequential([
        base_model,
        maxpool_layer,
        prediction_layer
    ])
    
    opt = tf.keras.optimizers.Adam(lr=0.004)
    model.compile(optimizer=opt,
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=['accuracy'])
    
    model.fit(
        train_generator,
        steps_per_epoch = train_generator.samples // 32,
        validation_data = validation_generator,
        validation_steps = validation_generator.samples // 32,
        epochs = 20)