tensorflow keras lstm autoencoder seq2seq

how to have a LSTM Autoencoder model over the whole vocab prediction while presenting words as embedding

So I have been working on LSTM Autoencoder model. I have also created various version of this model.

1. create the model using the already trained word embedding: in this scenario, I used the weights of already trained Glove vector, as the weight of features(text data). This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded =Lambda(rev_entropy)(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer="sgd", loss='mse')
    autoencoder.summary()
    checkpoint = ModelCheckpoint(filepath='checkpoint/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])

in the second scenario, I implemented the word embedding layer in the model itself:

This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, ), name="input")
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=SEQUENCE_LEN,trainable=False)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(EMBED_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"))
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,  validation_steps=num_test_steps)

in the third scenario, I did not use any embedding techniques but used the one hot encoding for the features. and this is the structure of the model:

`inputs = Input(shape=(SEQUENCE_LEN, VOCAB_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE, kernel_initializer="glorot_normal",), merge_mode="sum", name="encoder_lstm")(inputs)
encoded = Lambda(score_cooccurance,  name='Modified_layer')(encoded)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
autoencoder.compile(optimizer=sgd, loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath='checkpoint/50/{epoch}.hdf5')
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, callbacks=[checkpoint])`

As you see, in the first and second model Embed_size in the decoding is the number of neurons in that layer. it causes the output shape of encoder layer becomes [Latent_size, Embed_size].

in the third model, the output shape of the encoder is [Latent_size, Vocab_size].

Now my question

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

I appreciate if you can share your idea with me. One idea could be adding more layers, consider that with any cost I don't want to have Embed_size in the last layer.

Solution

Your questions:

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I like to use as reference the Tensorflow transformer model: https://github.com/tensorflow/models/tree/master/official/transformer

In language translation tasks the model input tends to be a token index, which then is subject to an embedding lookup resulting in a shape of (sequence_length, embedding_dims); the encoder itself works on this shape. The decoder output tends to be in the shape of (sequence_length, embedding_dims) also. For instance the model above, then transforms the decoder output into logits by doing a dot product between the output and the embedding vectors. This is the transformation they use: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

I would recommend an approach similar to the language translation models:

pre-stage:
- input_shape=(sequence_length, 1) [ i.e. token_index in [0.. vocab_size)
encoder:
- input_shape=(sequence_length, embedding_dims)
- output_shape=(latent_dims)
decoder:
- input_shape=(latent_dims)
- output_shape=(sequence_length, embedding_dims)

Pre-processing converts token indexes into embedding_dims. This can be used to generate both the encoder input as well as the decoder targets.

Post processing to convert embedding_dims to logits (in the vocab_index space).

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

That doesn't sound right. Typically what one is trying to achieve with an auto-encoder is to have a embedding vector for the sentence. So the output of the encoder in typically [latent_dims]. The output of the decoder needs to be translatable into [sequence_length, vocab_index (1) ] which is typically done by converting from embedding space to logits and then taking the argmax to convert to token index.