LSTM in keras, input shape, timesteps vs nfeatures

I'm working on a prediction model with keras, using LSTM layers, but have an issues understanding how to format my input data (the model does return nan also depending on the input format).

I will try to give a clear explanation !

Let's call X my input data and y my output data. Both are arrays of the same shape

X.shape = (20, 1001)  
y.shape = (20, 1001)

So basically 20 samples of 1001 values each.

Now let's define a simple LSTM model:

model = Sequential()
model.add(LSTM(32,activation='relu', input_shape=XXX ))
model.add(Dense(y.shape[1]))

The input shape (XXX) is left blank on purpose for now.

The Dense layer is the output data. So it has the length of one sample (1001)

The input shape for an LSTM layer is 3D : (samples, time steps, features)

Thus my input data can be reformat either considering a 1 timestep sample with 1001 features

X=trainX.reshape( (trainX.shape[0], 1, trainX.shape[1]) )

which would give an input_shape in the layer definition

input_shape=(1, trainX.shape[1])

Or considering 1001 timesteps with 1 feature

X=trainX.reshape( (trainX.shape[0], trainX.shape[1], 1) )

and the corresponding input_shape

input_shape=(trainX.shape[1], 1)

The first configuration (1 timestep, 1001 features) work.

The second configurations (1001 timesteps, 1 feature) do not work. The training very quickly returns nan values.

My question being, why would the second solution return nan values (I'm guessing I might have misunderstood what means timesteps) ?

Conceptually, considering my input array to be one feature and several timesteps make more sense for me than the opposite.

In the best case scenario, I would just try both to work with the best results.

One more thing ! LSTM is suppose to keep some memory (considering the value to have a direction of reading, which then could be read in the opposite direction with a Bidirectional embedded layer). If the data are presented as one time step, but several features, does this "directionality" of the array still have a meaning ?

Sorry for the long post, Hoping someone can bring some light onto this !

Solution

time series of 1001 is too big, if you use it you will have either vanishing gradient issue or exploding gradient, in your case it is exploding gradient which make your values too big.