Search code examples
pythondatabasetensorflowkerasdata-augmentation

Tensorflow does not apply data augmentation properly


I'm trying to apply the process of data augmentation to a database. I use the following code:

train_generator = keras.utils.image_dataset_from_directory(
        directory= train_dir,
        subset = "training",
        image_size = (50,50),
        batch_size = 32,
        validation_split = 0.3,
        seed = 1337,
        labels = "inferred",
        label_mode = 'binary'
    )

    

    validation_generator = keras.utils.image_dataset_from_directory(
        subset="validation",
        directory=validation_dir,
        image_size=(50,50),
        batch_size =40,
        seed=1337,
        validation_split = 0.3,
        labels = "inferred",
        label_mode ='binary'
    )

    
    
    data_augmentation = keras.Sequential([
        keras.layers.RandomFlip("horizontal"),
        keras.layers.RandomRotation(0.1),
        keras.layers.RandomZoom(0.1),
    ])
    

    train_dataset = train_generator.map(lambda x, y: (data_augmentation(x, training=True), y))

But when I try to run the training processe using this method, I get a "insuficient data" warning:

6/100 [>.............................] - ETA: 21s - loss: 0.7602 - accuracy: 0.5200WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 10 batches). You may need to use the repeat() function when building your dataset.

Yes, the original dataset is insuficient, but the data augmentation should provide more than enough data for the training. Does anyone know what's going on ?

EDIT:

fit call:

history = model.fit( 
train_dataset, 
epochs = 20, 
steps_per_epoch = 100, 
validation_data = validation_generator, 
validation_steps = 10, 
callbacks=callbacks_list)

This is the version I have using DataImageGenerator:

train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale =1/255,rotation_range = 40,width_shift_range = 0.2,height_shift_range = 0.2,shear_range = 0.2,zoom_range = 0.2,horizontal_flip = True)

train_generator = train_datagen.flow_from_directory(directory= train_dir,target_size = (50,50),batch_size = 32,class_mode = 'binary')

val_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
validation_generator = val_datagen.flow_from_directory(directory=validation_dir,target_size=(50,50),batch_size =40,class_mode ='binary')

This specific code (with this same number of epochs, steps_per_epoch and batchsize) was taken from the book deeplearning with python, by François Chollet, it's an example on page 141 of a data augmentation system. As you may have guessed, this produces the same results as the other method displayed.


Solution

  • When we state that data augmentation increases the number of instances, we usually understand that an altered version of a sample would be created for the model to process. It's just image preprocessing with randomness.

    If you closely inspect your training log, you will get your solution, shown below. The main issue with your approach is simply discussed in this post.

    WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.

    So, to solve this, we can use .repeat() function. To understand what it does, you can check this answer. Here is the sample code that should work for you.

    train_ds= keras.utils.image_dataset_from_directory(
        ...
     )
    train_ds = train_ds.map(
          lambda x, y: (data_augmentation(x, training=True), y)
    )
    val_ds = keras.utils.image_dataset_from_directory(
       ...
    )
    
    # using .repeat function
    train_ds = train_ds.repeat().shuffle(8 * batch_size)
    train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    
    val_ds = val_ds.repeat()
    val_ds = val_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    
    # specify step per epoch 
    history = model.fit(
      train_ds,
      validation_data=val_ds,
      epochs=..,
      steps_per_epoch = train_ds.cardinality().numpy(),
      validation_steps = val_ds.cardinality().numpy(),
    )