Search code examples
pythonpandaslstmrecurrent-neural-network

Training a RNN/LSTM model got KeyError equal to the val of the length


Trying to train this model

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

length = 60
n_features = X_train_s.shape[1]
batch_size = 1

early_stop = EarlyStopping(monitor = 'val_accuracy', mode = 'max', verbose = 1, patience = 5)

generator = TimeseriesGenerator(data = X_train_s, 
                                targets = Y_train[['TARGET_KEEP_LONG', 
                                                   'TARGET_KEEP_SHORT', 
                                                   'TARGET_STAY_FLAT']], 
                                length = length, 
                                batch_size = batch_size)


RNN_model = Sequential()
RNN_model.add(LSTM(180, activation = 'relu', input_shape = (length, n_features)))
RNN_model.add(Dense(3))
RNN_model.compile(optimizer = 'adam', loss = 'binary_crossentropy')

validation_generator = TimeseriesGenerator(data = X_test_s, 
                                           targets = Y_test[['TARGET_KEEP_LONG', 
                                                             'TARGET_KEEP_SHORT', 
                                                             'TARGET_STAY_FLAT']], 
                                           length = length, 
                                           batch_size = batch_size)


RNN_model.fit(generator, 
              epochs=20, 
              validation_data = validation_generator,
              callbacks = [early_stop])

I get the error "KeyError: 60" where actually 60 is the value of the variable "length" (if I change it, the error changes accordingly).

The shapes of the training dataset are

X_test_s.shape
(114125, 89)

same for X_train_s.shape as well as n_features == 89.


Solution

  • It was exhausting to find the cause due to the poor and misleading error message. Anyway, the trouble was on the target data set form, the TimeseriesGenerator does not accept panda dataframes, just np.arrays. Therefore this

     generator = TimeseriesGenerator(data = X_train_s, 
                                    targets = Y_train[['TARGET_KEEP_LONG', 'TARGET_KEEP_SHORT',                                                    'TARGET_STAY_FLAT']], length = length, batch_size = batch_size)
    

    shall have been written as

    generator = TimeseriesGenerator(X_train_s, pd.DataFrame.to_numpy(Y_train[['TARGET_KEEP_LONG', 'TARGET_KEEP_SHORT', 'TARGET_STAY_FLAT']]), length=length, batch_size=batch_size)
    

    in the case of just one target, it was enough

     generator = TimeseriesGenerator(data = X_train_s, targets = Y_train['TARGET_KEEP_LONG'], length = length, batch_size = batch_size) 
    

    just one level of squared brackets, not two.