tensorflow image-processing keras sampling data-augmentation

Sampling for large class and augmentation for small classes in each batch

Let's say we have 2 classes one is small and the second is large.

I would like to use for data augmentation similar to ImageDataGenerator for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).

Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).

Solution

What about sample_from_datasets function?

import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets

def augment(val):
    # Example of augmentation function
    return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)

big_dataset_size = 1000
small_dataset_size = 10

# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))

# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
    .repeat(big_dataset_size // small_dataset_size) \
    .map(augment)

dataset = sample_from_datasets(
    datasets=[dataset_class_large_positive, dataset_class_small_negative], 
    weights=[0.5, 0.5]
)

dataset = dataset.shuffle(100)
dataset = dataset.batch(6)

iterator = dataset.as_numpy_iterator()
for i in range(5):
    print(next(iterator))

# [109.        -10.044552  136.        140.         -1.0505208  -5.0829906]
# [122.        108.        141.         -4.0211563 126.        116.       ]
# [ -4.085523  111.         -7.0003924  -7.027302   -8.0362625  -4.0226436]
# [ -9.039093  118.         -1.0695585 110.        128.         -5.0553837]
# [100.        -2.004463  -9.032592  -8.041705 127.       149.      ]

Set up the desired balance between the classes in the weights parameter of sample_from_datasets.

As it was noticed by Yaoshiang, the last batches are imbalanced and the datasets length are different. This can be avoided by

# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)

instead of

# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
    .repeat(big_dataset_size // small_dataset_size) \
    .map(augment)

This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.