Search code examples
pythonkerasneural-network

Problem using fit_generator during keras training


I'm working with very big text datasets. I thought about using model.fit_generator method instead of simple model.fit, so i tried using this generator:

def TrainGenerator(inp, out):
  for i,o in zip(inp, out):
    yield i,o

When i try to use it during training using:

#inp_train, out_train are lists of sequences padded to 50 tokens
model.fit_generator(generator = TrainGenerator(inp_train,out_train),
                    steps_per_epoch = BATCH_SIZE * 100,
                    epochs = 20,
                    use_multiprocessing = True)

I get:

ValueError: Error when checking input: expected embedding_input to have shape (50,) but got array with shape (1,)

Now, i tried using simple model.fit method, and it works fine. So, i think that my problem is in the generator but, since i'm new in using generators, i don't know how to solve it. The full model summary is:

Layer (type)                 Output Shape            
===========================================
Embedding (Embedding)      (None, 50, 400)           
___________________________________________
Bi_LSTM_1 (Bidirectional)  (None, 50, 1024)          
___________________________________________
Bi_LSTM_2 (Bidirectional)  (None, 50, 1024)          
___________________________________________
Output (Dense)             (None, 50, 153)           
===========================================

Solution

  • At first I added the solution as edit to the question, for the sake of clarity I post it here as answer:

    The first comment triggered me somehow. I realized that i misunderstood how generators work. The output of my generator was a list of shape 50, and not a list of N lists shaped 50. So i dug into keras documentation and found this. So, i changed my way of working and this is the class working as a generator:

    class BatchGenerator(tf.keras.utils.Sequence):
    
        def __init__(self, x_set, y_set, batch_size):
            self.x, self.y = x_set, y_set
            self.batch_size = batch_size
    
        def __len__(self):
            return int(np.ceil(len(self.x) / float(self.batch_size)))
    
        def __getitem__(self, idx):
            batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
            batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
            
            return batch_x, to_categorical(batch_y,num_labels)
    

    Where to_categorical is the function:

    def to_categorical(sequences, categories):
        cat_sequences = []
        for s in sequences:
            cats = []
            for item in s:
                cats.append(np.zeros(categories))
                cats[-1][item] = 1.0
            cat_sequences.append(cats)
        return np.array(cat_sequences)
    

    So, what i noticed now is a good performance boost of my network, each epoch now lasts the half.