Search code examples
tensorflowimage-processingkerassamplingdata-augmentation

Sampling for large class and augmentation for small classes in each batch


Let's say we have 2 classes one is small and the second is large.

I would like to use for data augmentation similar to ImageDataGenerator for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).

Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).


Solution

  • What about sample_from_datasets function?

    import tensorflow as tf
    from tensorflow.python.data.experimental import sample_from_datasets
    
    def augment(val):
        # Example of augmentation function
        return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)
    
    big_dataset_size = 1000
    small_dataset_size = 10
    
    # Init some datasets
    dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
    dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))
    
    # Upsample and augment small dataset
    dataset_class_small_negative = dataset_class_small_negative \
        .repeat(big_dataset_size // small_dataset_size) \
        .map(augment)
    
    dataset = sample_from_datasets(
        datasets=[dataset_class_large_positive, dataset_class_small_negative], 
        weights=[0.5, 0.5]
    )
    
    dataset = dataset.shuffle(100)
    dataset = dataset.batch(6)
    
    iterator = dataset.as_numpy_iterator()
    for i in range(5):
        print(next(iterator))
    
    # [109.        -10.044552  136.        140.         -1.0505208  -5.0829906]
    # [122.        108.        141.         -4.0211563 126.        116.       ]
    # [ -4.085523  111.         -7.0003924  -7.027302   -8.0362625  -4.0226436]
    # [ -9.039093  118.         -1.0695585 110.        128.         -5.0553837]
    # [100.        -2.004463  -9.032592  -8.041705 127.       149.      ]
    

    Set up the desired balance between the classes in the weights parameter of sample_from_datasets.

    As it was noticed by Yaoshiang, the last batches are imbalanced and the datasets length are different. This can be avoided by

    # Repeat infinitely both datasets and augment the small one
    dataset_class_large_positive = dataset_class_large_positive.repeat()
    dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)
    

    instead of

    # Upsample and augment small dataset
    dataset_class_small_negative = dataset_class_small_negative \
        .repeat(big_dataset_size // small_dataset_size) \
        .map(augment)
    

    This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.