I want to build a Sequence-to-sequence autoencoder in keras. The purpose is to "doc2vec".
In the documents on keras blog, I found an example: https://blog.keras.io/building-autoencoders-in-keras.html
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
What if I need to add an embedding layer to this? If we are dealing with a paragraph of text, we suppose should firstly tokenize the text, embedding it with pre-trained vector, right?
Do I need a Dense or time distributed dense layer in decoder? Do I need to reverse the order of the sequence?
Thanks in advance.
The embedding layer can only be used as the first layer in a model as the documentation states, so something like this:
inputs = Input(shape=(timesteps, input_dim))
embedded = Embedding(vocab_size, embedding_size, mask_zero=True, ...))(inputs)
encoded = LSTM(latent_dim)(embedded)
We should firstly tokenize the text, embedding it with pre-trained vector, right? Yes, this is the default option. You only train your own embeddings if you have a sufficiently large enough corpus otherwise GloVe are often used. There is a Keras example that uses GloVe and the internal Tokenizer
to pass text into a model with Embedding layer.
For decoding, you will need a Dense
layer but using TimeDistributed
is optional with version 2. By default Dense
will apply the kernel to every timestep of your 3D tensor you pass:
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Dense(vocab_size, activation='softmax')(decoded)
# (batch_size, timesteps, vocab_size)
It's worth noting that taking the top N most frequent words will speed up training, otherwise that softmax
will become very costly to compute. The Keras example also takes a limited number of words and every other word is mapped to a special UNKnown token.