tensorflow tensorflow-datasets feature-selection

Preprocessing for TensorFlow Dataset 'cats_vs_dogs'

I am trying to create a preprocessing function so that the training_dataset can be directly fed into a keras sequential neural network. The preprocess function should return features and labels.

def preprocessing_function(data):
        features = ...
        labels = ...
        return features, labels

dataset, info = tfds.load(name='cats_vs_dogs', split=tfds.Split.TRAIN, with_info=True)
    
training_dataset = dataset.map(preprocessing_function)

How should I write the preprocessing_function? I spent several hours researching and trying to make it happen, but to no avail. Hoping someone can assist.

Solution

Here are two functions for preprocessing. FIrst one will be applied to both train and validation data to normalize the data and resize to the expected size of network. The second function, augmentation, will be applied to training set only. The type of augmentation you want to do depends on your dataset and application, but I provided this as an example.

#Fetching, pre-processing & preparing data-pipeline
def preprocess(ds):
    x = tf.image.resize_with_pad(ds['image'], IMG_SIZE_W, IMG_SIZE_H)
    x = tf.cast(x, tf.float32)
    x = (x-MEAN)/(VARIANCE)
    y = tf.one_hot(ds['label'], NUM_CLASSES)
    return x, y

def augmentation(image,label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.resize_with_crop_or_pad(image, IMG_W+4, IMG_W+4) # zero pad each side with 4 pixels
    image = tf.image.random_crop(image, size=[BATCH_SIZE, IMG_W, IMG_H, 3]) # Random crop back to 32x32
    return image, label

and to load training and validation datasets, do something like this:

def get_dataset(dataset_name, shuffle_buff_size=1024, batch_size=BATCH_SIZE, augmented=True):
    train, info_train = tfds.load(dataset_name, split='train[:80%]', with_info=True)
    val, info_val = tfds.load(dataset_name, split='train[80%:]', with_info=True)

    TRAIN_SIZE = info_train.splits['train'].num_examples * 0.8
    VAL_SIZE = info_train.splits['train'].num_examples * 0.2

    train = train.map(preprocess).cache().repeat().shuffle(shuffle_buff_size).batch(batch_size)
    if augmented==True:
        train = train.map(augmentation)
    train = train.prefetch(tf.data.experimental.AUTOTUNE)

    val = val.map(preprocess).cache().repeat().batch(batch_size)
    val = val.prefetch(tf.data.experimental.AUTOTUNE)

    return train, info_train, val, info_val, TRAIN_SIZE, VAL_SIZE