I'm working with very big text datasets. I thought about using model.fit_generator
method instead of simple model.fit
, so i tried using this generator:
def TrainGenerator(inp, out):
for i,o in zip(inp, out):
yield i,o
When i try to use it during training using:
#inp_train, out_train are lists of sequences padded to 50 tokens
model.fit_generator(generator = TrainGenerator(inp_train,out_train),
steps_per_epoch = BATCH_SIZE * 100,
epochs = 20,
use_multiprocessing = True)
I get:
ValueError: Error when checking input: expected embedding_input to have shape (50,) but got array with shape (1,)
Now, i tried using simple model.fit
method, and it works fine. So, i think that my problem is in the generator but, since i'm new in using generators, i don't know how to solve it. The full model summary is:
Layer (type) Output Shape
===========================================
Embedding (Embedding) (None, 50, 400)
___________________________________________
Bi_LSTM_1 (Bidirectional) (None, 50, 1024)
___________________________________________
Bi_LSTM_2 (Bidirectional) (None, 50, 1024)
___________________________________________
Output (Dense) (None, 50, 153)
===========================================
At first I added the solution as edit to the question, for the sake of clarity I post it here as answer:
The first comment triggered me somehow. I realized that i misunderstood how generators work. The output of my generator was a list of shape 50, and not a list of N lists shaped 50. So i dug into keras documentation and found this. So, i changed my way of working and this is the class working as a generator:
class BatchGenerator(tf.keras.utils.Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return batch_x, to_categorical(batch_y,num_labels)
Where to_categorical
is the function:
def to_categorical(sequences, categories):
cat_sequences = []
for s in sequences:
cats = []
for item in s:
cats.append(np.zeros(categories))
cats[-1][item] = 1.0
cat_sequences.append(cats)
return np.array(cat_sequences)
So, what i noticed now is a good performance boost of my network, each epoch now lasts the half.