Search code examples
pythontensorflowkeraslstmautoencoder

how to reshape text data to be suitable for LSTM model in keras


Update1:

The code Im referring is exactly the code in the book which you can find it here.

The only thing is that I don't want to have embed_size in the decoder part. That's why I think I don't need to have embedding layer at all because If I put embedding layer, I need to have embed_size in the decoder part(please correct me if Im wrong).

Overall, Im trying to adopt the same code without using the embedding layer, because I need o have vocab_size in the decoder part.

I think the suggestion provided in the comment could be correct (using one_hot_encoding) how ever I faced with this error:

When I did one_hot_encoding:

tf.keras.backend.one_hot(indices=sent_wids, classes=vocab_size)

I received this error:

in check_num_samples you should specify the + steps_name + argument ValueError: If your data is in the form of symbolic tensors, you should specify the steps_per_epoch argument (instead of the batch_size argument, because symbolic tensors are expected to produce batches of input data)

The way that I have prepared data is like this:

shape of sent_lens is (87716, 200) and I want to reshape it in a way I can feed it into LSTM. here 200 stands for the sequence_lenght and 87716 is number of samples I have.

below is The code for LSTM Autoencoder:

inputs = Input(shape=(SEQUENCE_LEN,VOCAB_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='mse')
autoencoder.summary()
history = autoencoder.fit(Xtrain, Xtrain,batch_size=BATCH_SIZE, 
epochs=NUM_EPOCHS)

Do I still need to do anything extra, if No, why I can not get this works?

Please let me know which part is not clear I will explain.

Thanks for your help:)


Solution

  • So as said in the comments it turns out I just needed to do one_hot_encoding.

    when I did one_hot encoding using the tf.keras.backend it throws the error that I have updated in my question.

    Then I tried to_categorical(sent_wids, num_classes=VOCAB_SIZE) and it fixed (however faced with memory error :D which is different story)!!!

    I should also mention that I tried sparse_categorical_crossentropy instead of one_hot_encoding though it did not work!

    Thank you for all your help:)