python tensorflow keras lstm autoencoder

Variable length input for LSTM autoencoder- Keras

I am trying an autoencoder model with LSTM layers in Keras for text outlier detection. I have encoded every sentence into a sequence of numbers, with each number representing a letter.

So far I have already trained a model with a fixed-length input, by padding zeros to each of the 4000 sequences up to maxlength = 40 thus training the model with a [4000,40,1] shaped array ([batch_size, timesteps, features]).

Now I am wondering how I can use such an autoencoder model without padding zeros to each sequence (sentence) thus training and predicting with the actual size of each sentence (sequence).

At the moment I have standardized every sequence so my train data (x_train) is a list of arrays and every array in list represents a standardized sequence of numbers of different lengths.

To input this data to the LSTM model I am trying to reshape into 3d array with:

x_train=np.reshape(x_train, (len(x_train), 1, 1))

not sure if this is correct though.

My model looks like this (I've removed the input_shape parameter so the model can accept variable-length input):


model = Sequential()
model.add(LSTM(20, activation='tanh',return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(5, activation='tanh', return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(20, activation='tanh', return_sequences=True))
model.add((Dense(1,activation='tanh')))

Then when trying to compile and train the model

nb_epoch = 10
model.compile(optimizer='rmsprop', loss='mse')
checkpointer = ModelCheckpoint(filepath="text_model.h5",
                               verbose=0,
                               save_best_only=True)

es_callback = keras.callbacks.EarlyStopping(monitor='val_loss')

history = model.fit(x_train, x_train,
                    epochs=nb_epoch,
                    shuffle=True,
                    validation_data=(x_test, x_test),
                    verbose=0,
                    callbacks=[checkpointer,es_callback])

I get error : "ValueError: setting an array element with a sequence."

My model summary is the following:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_6 (LSTM)                (None, 1, 20)             1760      
_________________________________________________________________
lstm_7 (LSTM)                (None, 1, 15)             2160      
_________________________________________________________________
lstm_8 (LSTM)                (None, 1, 5)              420       
_________________________________________________________________
lstm_9 (LSTM)                (None, 1, 15)             1260      
_________________________________________________________________
lstm_10 (LSTM)               (None, 1, 20)             2880      
_________________________________________________________________
dense_2 (Dense)              (None, 1, 1)              21        
=================================================================
Total params: 8,501
Trainable params: 8,501
Non-trainable params: 0
_________________________________________________________________

So my question is if it's possible to train and predict with variable-length input sequence in an LSTM autoencoder model.

And if my thinking process on text outlier detection using such a model architecture is correct.

Solution

Padding still has to be done such that the input can be 3d array(tensor), but Keras actually provides masking layer for you to ignore the padded 0s in the input tensor. So that the model will not be affected by the paddings.

from keras.models import Sequential
from keras.layers import LSTM, Dense, Masking

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(timesteps, features)))
model.add(LSTM(20, activation='tanh',return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(5, activation='tanh', return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(20, activation='tanh', return_sequences=True))
model.add((Dense(1,activation='tanh')))