python keras recurrent-neural-network autoencoder

how to build Sequence-to-sequence autoencoder in keras with embedding layer?

I want to build a Sequence-to-sequence autoencoder in keras. The purpose is to "doc2vec".

In the documents on keras blog, I found an example: https://blog.keras.io/building-autoencoders-in-keras.html

from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model

inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)

sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

What if I need to add an embedding layer to this? If we are dealing with a paragraph of text, we suppose should firstly tokenize the text, embedding it with pre-trained vector, right?

Do I need a Dense or time distributed dense layer in decoder? Do I need to reverse the order of the sequence?

Thanks in advance.

Solution

The embedding layer can only be used as the first layer in a model as the documentation states, so something like this:

inputs = Input(shape=(timesteps, input_dim))
embedded = Embedding(vocab_size, embedding_size, mask_zero=True, ...))(inputs)
encoded = LSTM(latent_dim)(embedded)

We should firstly tokenize the text, embedding it with pre-trained vector, right? Yes, this is the default option. You only train your own embeddings if you have a sufficiently large enough corpus otherwise GloVe are often used. There is a Keras example that uses GloVe and the internal Tokenizer to pass text into a model with Embedding layer.

For decoding, you will need a Dense layer but using TimeDistributed is optional with version 2. By default Dense will apply the kernel to every timestep of your 3D tensor you pass:

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Dense(vocab_size, activation='softmax')(decoded)
# (batch_size, timesteps, vocab_size)

It's worth noting that taking the top N most frequent words will speed up training, otherwise that softmax will become very costly to compute. The Keras example also takes a limited number of words and every other word is mapped to a special UNKnown token.