python tensorflow conv-neural-network image-preprocessing

How to check preprocessing time/speed in Colab?

I am training a neural network on Google Colab GPU. Therefore, I synchronized the input images (180k in total, 105k for training, 76k for validation) with my Google Drive. Then I mount the Google Drive and go from there. I load a csv-file with image paths and labels in Google Colab and store it as pandas dataframe. After that I use a list of image paths and labels.

I take this function to get my labels onehot-encoded because I need a special output shape (7, 35) per label, which cannot be done by the existing default functions:

#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string

def my_onehot_encoded(label):
    # define universe of possible input values
    characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
    # define a mapping of chars to integers
    char_to_int = dict((c, i) for i, c in enumerate(characters))
    int_to_char = dict((i, c) for i, c in enumerate(characters))
    # integer encode input data
    integer_encoded = [char_to_int[char] for char in label]
    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        character = [0 for _ in range(len(characters))]
        character[value] = 1
        onehot_encoded.append(character)

    return onehot_encoded

After that I use a customized DataGenerator to get the data in batches into my model. x_set is a list of image paths to my images and y_set are the onehot-encoded labels:

class DataGenerator(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
        batch_x = batch_x * 1./255
        batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_y = np.array(batch_y)

        return batch_x, batch_y

And with this code I apply the DataGenerator to my data:

training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)

When I now train my model one epoch lasts 25-40 minutes which is very long.

model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    steps_per_epoch = num_train_samples // 16,
                    validation_steps = num_val_samples // 16,
                    epochs = 10, workers=6, use_multiprocessing=True)

I now was wondering how to measure preprocessing time because I don't think it is due to the model size, because I already experimented with models with fewer parameters but the time for training did not reduce significantly... So, I am suspicious regarding the preprocessing...

Solution

To measure time in Colab, you can use this autotime package:

!pip install ipython-autotime

%load_ext autotime

Additionally for profiling, you can use %time as mentioned here.

In general to ensure generator runs faster, suggest you to copy the data from gdrive to local host of that colab, otherwise it can get slower.

If you are using Tensorflow 2.0, cause could be this bug.

Work arounds are:

Call tf.compat.v1.disable_eager_execution() at the start of the code
Use model.fit rather than model.fit_generator. The former supports generators anyway.
Downgrade to TF 1.14

Regardless of Tensorflow version, limit how much disk access you are doing, this that is often a bottleneck.

Note that there does seem to be an issue with generators being slow in TF 1.13.2 and 2.0.1 (at least).