machine-learning neural-network lstm normalization recurrent-neural-network

In a LSTM should normalization be done before or after the split in train and test set?

Usually, when using a NN, I do the normalization in this form:

scaler = StandardScaler()
train_X = scaler.fit_transform( train_X )
test_X = scaler.transform( test_X )

That is, I normalize after the split, so that there are no leaks from the test set to the train set. But I am having doubts about this when using a LSTM.

Imagine that my last sequence in the train set in a LSTM is X = [x6, x7, x8], Y = [x9].

Then, my first sequence in the test set should be X = [x7, x8, x9], Y = [x10].

So, does it make sense to normalize the data after splitting if I end up mixing the values from the two sets in the X of the test set? Or should I normalize the entire dataset before with

scaler = StandardScaler()
data = scaler.fit_transform( data )

and then do the split?

Solution

The normalization procedure as you show it is the only correct approach for every machine learning problem, and LSTM ones are by no means an exception.

When it comes to similar dilemmas, there is a general rule of thumb than can be useful to clarify confusions:

During the whole model building process (including all necessary preprocessing), pretend that you have no access at all to any test set before it comes to using this test set to assess your model performance.

In other words, pretend that your test set comes only after having deployed your model and it starts receiving data completely new and unseen until then.

So conceptually, it may be helpful to move the third line of your first code snippet here to the end, i.e.:

X_train, X_test, y_train, y_test = train_test_split(X, y)
### FORGET X_test from this point on...

X_train = scaler.fit_transform(X_train)

# further preprocessing, feature selection etc...

# model building & fitting...

model.fit(X_train, y_train)

# X_test just comes in:

X_test = scaler.transform(X_test)
model.predict(X_test)