I am training a neural network in tensor flow, and because I was running out of memory when training to load my whole training set (input images and "ground truth" images), I am trying to stream data using a generator such that only a few images are loaded at a time. My code takes each image and subdivides it into a set of many images. This is the code for the generator class I am using, based on a tutorial I found online:
class DataGenerator(keras.utils.all_utils.Sequence):
'Generates data for Keras'
def __init__(self,
channel,
pairs,
prediction_size=200,
input_normalizing_function_name='standardize',
label="",
batch_size=1):
'Initialization'
self.channel = channel
self.prediction_size = prediction_size
self.batch_size = batch_size
self.pairs = pairs
self.id_list = list(self.pairs.keys())
self.input_normalizing_function_name = input_normalizing_function_name
self.label = label
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.id_list) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
print("{} Indexes is {}".format(self.label, indexes))
# Find list of IDs
subset_pair_id_list = [self.id_list[k] for k in indexes]
print("\t{} subset_pair_id_list is {}".format(self.label, subset_pair_id_list))
# Generate data
normalized_input_frames, normalized_gt_frames = self.__data_generation(subset_pair_id_list)
print("in __getitem, returning data batch")
return normalized_input_frames, normalized_gt_frames
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = list(range(len(self.id_list)))
def __data_generation(self, subset_pair_id_list):
'subdivides each image into an array of multiple images'
# Initialization
normalized_input_frames, normalized_gt_frames = get_normalized_input_and_gt_dataframes(
channel = self.channel,
pairs_for_training = self.pairs,
pair_ids=subset_pair_id_list,
input_normalizing_function_name = self.input_normalizing_function_name,
prediction_size=self.prediction_size
)
print("\t\t\t~~~In data generation: input shape: {}, gt shape: {}".format(normalized_input_frames.shape, normalized_gt_frames.shape))
return input_frames, gt_frames
I am using this generator for a set of data that is used for training, and then also using another instance of it for validation, for example:
training_data_generator = DataGenerator(
pairs=pairs_for_training,
prediction_size=prediction_size,
input_normalizing_function_name=input_normalizing_function_name,
batch_size=batch_size,
channel=channel,
label="training generator"
)
Then I start training, which I am running with model.fit:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
learning_rate = 0.0001
opt = tf.keras.optimizers.Adam(learning_rate)
l = tf.keras.losses.MeanSquaredError()
print("Compiling model...")
model.compile(loss=l, optimizer=opt)
print('\tTraining model...')
with tf.device('/device:GPU:0'):
model_history = model.fit(
training_data_generator,
validation_data=validation_data_generator,
epochs=eps,
callbacks=[callback]
)
This is the last bit of the print outputs before the failure:
Epoch 1/1000
training generator Indexes is [0]
training generator subset_pair_id_list is ['A']
Loading batch of 1 pairs...
['A']
num data is 1
~~~In data generation: input shape: (5, 100, 100, 1), gt shape: (5, 100, 100, 1)
in __getitem, returning data batch
This step fails, however, with a strange error about tensor size mismatches, which is due to my use of the generator (it didn't happen before without the generators):
File "/root/micromamba/envs/training/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 3 must match. Input 1 has shape [5 25 25 32] and doesn't match input 0 with shape [5 24 24 64].
[[node gradient_tape/model/concatenate/ConcatOffset (defined at /bin/train.py:633) ]] [Op:__inference_train_function_1982]
I tried using breakpoints to delve into the tensor flow code and figure out why it is generating these tensors but couldn't find the function that was actually making them, and couldn't get to the bottom of what's going on. You can see that each returned set of input and ground truth data has shape (5, 100, 100, 1), so I don't know where the 25, 24, 32, and 64 values would be coming from in that error message. What might be going on here? I was under the assumption that each batch was returned and used for training, then thrown out before the next batch was fetched by the generator, but it seems like some sort of concatenation operation is being attempted based on the error message.
It turns out that there was nothing wrong with the way I was using the generator. Rather, it was that the size I was specifying for my image meant that as it was downsampled in the course of progressing through my model layers, and then upsampled again (this is a u-net), since the image size was not a multiple of 16 there were rounding errors and thus I was ending up with layers of different sizes that were trying to be concatenated. This explains it: https://stackoverflow.com/questions/68266736/tensorflow-python-framework-errors-impl-invalidargumenterror-all-dimensions-exc#:~:text=It%20is%20likely%20originates%20from,summary()%20.