training loss is nan in keras LSTM

I have tun this code in google colab with GPU to create a multilayer LSTM. It is for time series prediction.

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, LSTM, BatchNormalization
from keras.optimizers import SGD
model = Sequential()
model.add(LSTM(units = 50, activation = 'relu', return_sequences=True, input_shape= 
(1,len(FeaturesDataFrame.columns))))
model.add(Dropout(0.2))
model.add(LSTM(3, return_sequences=False))
model.add(Dense(1))
opt = SGD(lr=0.01, momentum=0.9, clipvalue=5.0)
model.compile(loss='mean_squared_error', optimizer=opt)

Note that I have used used the gradient-clipping. But still, when I train this model, it return nan as the training loss:

history = model.fit(X_t_reshaped, train_labels, epochs=20, batch_size=96, verbose=2)

This is the result

Epoch 1/20
316/316 - 2s - loss: nan 
Epoch 2/20
316/316 - 1s - loss: nan 
Epoch 3/20
316/316 - 1s - loss: nan
Epoch 4/20
316/316 - 1s - loss: nan
Epoch 5/20
316/316 - 1s - loss: nan
Epoch 6/20
316/316 - 1s - loss: nan
Epoch 7/20
316/316 - 1s - loss: nan 
Epoch 8/20
316/316 - 1s - loss: nan 
Epoch 9/20
316/316 - 1s - loss: nan 
Epoch 10/20 
316/316 - 1s - loss: nan
Epoch 11/20
316/316 - 1s - loss: nan
Epoch 12/20
316/316 - 1s - loss: nan
Epoch 13/20
316/316 - 1s - loss: nan
Epoch 14/20
316/316 - 1s - loss: nan
Epoch 15/20
316/316 - 1s - loss: nan 
Epoch 16/20
316/316 - 1s - loss: nan
Epoch 17/20
316/316 - 1s - loss: nan
Epoch 18/20
316/316 - 1s - loss: nan
Epoch 19/20
316/316 - 1s - loss: nan
Epoch 20/20
316/316 - 1s - loss: nan

Solution

I'm more familiar with working with PyTorch than Keras. However there are still a couple of things I would recommend doing:

Check your data. Ensure that there are no missing or null values in the data that you pass into your model. This is is the most likely culprit. A single null value will cause the loss to be NaN.
You could try lowering the learning rate (0.001 or something even smaller) and/or removing gradient clipping. I've actually had gradient contributing be the cause of NaN loss before.
Try scaling your data (though unscaled data will usually cause infinite losses rather than NaN loses). Use StandardScaler or one of the other scalers in sklearn.

If all that fails then I'd try to just pass some very simple dummy data into the model and see if the problem persists. Then you will know if it is a code problem or a data problem. Hope this helps and feel free to ask questions if you have them.