I am training a neural network on Google Colab GPU. Therefore, I synchronized the input images (180k in total, 105k for training, 76k for validation) with my Google Drive. Then I mount the Google Drive and go from there. I load a csv-file with image paths and labels in Google Colab and store it as pandas dataframe. After that I use a list of image paths and labels.
I take this function to get my labels onehot-encoded because I need a special output shape (7, 35)
per label, which cannot be done by the existing default functions:
#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
After that I use a customized DataGenerator
to get the data in batches into my model. x_set is a list of image paths to my images and y_set
are the onehot-encoded labels:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
And with this code I apply the DataGenerator
to my data:
training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)
When I now train my model one epoch lasts 25-40 minutes which is very long.
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 16,
validation_steps = num_val_samples // 16,
epochs = 10, workers=6, use_multiprocessing=True)
I now was wondering how to measure preprocessing time because I don't think it is due to the model size, because I already experimented with models with fewer parameters but the time for training did not reduce significantly... So, I am suspicious regarding the preprocessing...
To measure time in Colab, you can use this autotime
package:
!pip install ipython-autotime
%load_ext autotime
Additionally for profiling, you can use %time
as mentioned here.
In general to ensure generator
runs faster, suggest you to copy the data from gdrive
to local host of that colab
, otherwise it can get slower.
If you are using Tensorflow 2.0
, cause could be this bug.
Work arounds are:
tf.compat.v1.disable_eager_execution()
at the start of the codemodel.fit
rather than model.fit_generator
. The former supports generators anyway.TF 1.14
Regardless of Tensorflow
version, limit how much disk access you are doing, this that is often a bottleneck.
Note that there does seem to be an issue with generators being slow in TF
1.13.2
and 2.0.1
(at least).