Let's say we have 2 classes one is small and the second is large.
I would like to use for data augmentation similar to ImageDataGenerator
for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).
Also, I would like to continue using image_dataset_from_directory
(since the dataset doesn't fit into RAM).
What about
sample_from_datasets
function?
import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets
def augment(val):
# Example of augmentation function
return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)
big_dataset_size = 1000
small_dataset_size = 10
# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
dataset = sample_from_datasets(
datasets=[dataset_class_large_positive, dataset_class_small_negative],
weights=[0.5, 0.5]
)
dataset = dataset.shuffle(100)
dataset = dataset.batch(6)
iterator = dataset.as_numpy_iterator()
for i in range(5):
print(next(iterator))
# [109. -10.044552 136. 140. -1.0505208 -5.0829906]
# [122. 108. 141. -4.0211563 126. 116. ]
# [ -4.085523 111. -7.0003924 -7.027302 -8.0362625 -4.0226436]
# [ -9.039093 118. -1.0695585 110. 128. -5.0553837]
# [100. -2.004463 -9.032592 -8.041705 127. 149. ]
Set up the desired balance between the classes in the weights
parameter of sample_from_datasets
.
As it was noticed by Yaoshiang, the last batches are imbalanced and the datasets length are different. This can be avoided by
# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)
instead of
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.