Search code examples
pythonkerasrecurrent-neural-networkautoencoder

how to build Sequence-to-sequence autoencoder in keras with embedding layer?


I want to build a Sequence-to-sequence autoencoder in keras. The purpose is to "doc2vec".

In the documents on keras blog, I found an example: https://blog.keras.io/building-autoencoders-in-keras.html

from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model

inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)

sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

What if I need to add an embedding layer to this? If we are dealing with a paragraph of text, we suppose should firstly tokenize the text, embedding it with pre-trained vector, right?

Do I need a Dense or time distributed dense layer in decoder? Do I need to reverse the order of the sequence?

Thanks in advance.


Solution

  • The embedding layer can only be used as the first layer in a model as the documentation states, so something like this:

    inputs = Input(shape=(timesteps, input_dim))
    embedded = Embedding(vocab_size, embedding_size, mask_zero=True, ...))(inputs)
    encoded = LSTM(latent_dim)(embedded)
    

    We should firstly tokenize the text, embedding it with pre-trained vector, right? Yes, this is the default option. You only train your own embeddings if you have a sufficiently large enough corpus otherwise GloVe are often used. There is a Keras example that uses GloVe and the internal Tokenizer to pass text into a model with Embedding layer.

    For decoding, you will need a Dense layer but using TimeDistributed is optional with version 2. By default Dense will apply the kernel to every timestep of your 3D tensor you pass:

    decoded = RepeatVector(timesteps)(encoded)
    decoded = LSTM(input_dim, return_sequences=True)(decoded)
    decoded = Dense(vocab_size, activation='softmax')(decoded)
    # (batch_size, timesteps, vocab_size)
    

    It's worth noting that taking the top N most frequent words will speed up training, otherwise that softmax will become very costly to compute. The Keras example also takes a limited number of words and every other word is mapped to a special UNKnown token.