python tensorflow keras lstm gradient-exploding

LSTM network loss is nan for batch size bigger than one

I am trying to analyse EEG data using LSTM network, I split the data into 4 seconds segment which resulted in around 17000 data samples. To that end, I build the following network bellow:

def load_model():
        model = Sequential()
        model.add(LSTM(5,recurrent_dropout=0.1,activation="relu",input_shape=(data_length, number_of_channels),
                    return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.00001, l2=0.00001)))
        model.add(Dense(512, activation = 'relu'))
        model.add(Dense(512, activation = 'relu'))
        model.add(Dropout(0.2))
        model.add(Dense(units=1, activation="sigmoid"))
        model.compile(optimizer=Adam(learning_rate=0.00001,clipvalue=1.5), loss='binary_crossentropy',
                    metrics=['accuracy', F1_scores,Precision,Sensitivity,Specificity],run_eagerly=True)
        return model

When training the loss goes to nan immediately from the first few batches. To avoid that, I tried adding recurrent dropout, le/l2 regularizes, clipping the gradient as well as normal dropout. I also tried changing the values of the learning rate and the batch size. The only thing that worked is having the recurrent dropout at 0.9 and the having low l1 and l2 score (0.00001), also I had to lower the number of cells in the LSTM network from the initial 30 to 5. Is there any other way to avoid loss from doing that without drooping so many feature and having high penalty on the gradient ?

I am using tensorflow-directml provided by microsoft with tensoflow version 1.15.1 with keras 2.7.0.

Solution

The problem was solved by initializing the kernel of the LSTM layer to small values. This was accomplished by changing the following line:

model.add(LSTM(5,recurrent_dropout=0.1,activation="relu",input_shape=(data_length, number_of_channels),
                    return_sequences=True, kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.00001, l2=0.00001)))

To:

model.add(LSTM(5,recurrent_dropout=0.2, kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.00001, seed=7)
                    ,activation="relu",input_shape=(data_length, number_of_channels),return_sequences=True))