matrix machine-learning keras lstm dimension

Understanding input_shape parameter in LSTM with Keras

I'm trying to use the example described in the Keras documentation named "Stacked LSTM for sequence classification" (see code below) and can't figure out the input_shape parameter in the context of my data.

I have as input a matrix of sequences of 25 possible characters encoded in integers to a padded sequence of maximum length 31. As a result, my x_train has the shape (1085420, 31) meaning (n_observations, sequence_length).

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

data_dim = 16
timesteps = 8
num_classes = 10

# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
               input_shape=(timesteps, data_dim)))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32))  # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, num_classes))

# Generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, num_classes))

model.fit(x_train, y_train,
          batch_size=64, epochs=5,
          validation_data=(x_val, y_val))

In this code x_train has the shape (1000, 8, 16), as for an array of 1000 arrays of 8 arrays of 16 elements. There I get completely lost on what is what and how my data can reach this shape.

Looking at Keras doc and various tutorials and Q&A, it seems I'm missing something obvious. Can someone give me a hint of what to look for ?

Thanks for your help !

Solution

So data input to LSTM should have shape (nb_of_samples, seq_len, features). In your case - as your feature vector consist of only one integer - you should resize your X_train should have shape (1085420, 31, 1). As this representation is not very suited for neural networks - you should either:

Change your representation to one-hot encoding - then your output should have shape (1085420, 31, 25).
Use Embedding layer and leave (1085420, 31) shape.