tensorflow keras deep-learning lstm seq2seq

Tensorflow & Keras: LSTM performs bad on seq2seq problem with clear solution

I am learning about tensorflow, and seq2seq problems for machine translation. For this I gave me the following task:

I created an Excel, containing random dates in different types, for example:

05.09.2192
martes, 07 de mayo de 2329
Friday, 30 December, 2129

In my dataset, each type is occuring 1000 times. These are my train (X) value. My target (Y) values are in one half always in this type:

05.09.2192
07.03.2329
30.12.2129

And in another half in this type:

Samstag, 12. Juni 2669
Donnerstag, 1. April 2990
Freitag, 10. November 2124

To make the model beeing able to differentiate these two Y values, another context information (C) is given as text:

Ausgeschrieben (written out)
Datum (date)

So some rows look like this:

So my goal is, to create a model, which is able to "translate" any date type to the german date type e.g. 05.09.2192.

The dataset contains 34.000 pairs.

To solve this, I use a character based tokenizer to transform text into integers:

tokenizer = keras.preprocessing.text.Tokenizer(filters='', char_level=True, oov_token="|")

I use an LSTM encoder-decoder model and I expect it, to reach an perfect accuracy, since the dependency between X and Y can be solved perfectly.

However, I reach up to an maximum of 72% of accuracy. Even worse, the accuracy is only reaching that much, because the padding is generated well. E.g. most of the Y values are pretty short and are therefore padded. So 12.02.2001 becomes e.g. ||||||||||||||||||||12.02.2001. So the model learns well to generate the padding token, but not the expected value.

This is the model structure I used at my latest test:

from tensorflow.keras.layers import Concatenate

encoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
decoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, 1)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder_0 = keras.layers.Dense(128)(encoder_embeddings)
encoder_0d = keras.layers.Dropout(0.4)(encoder_0)
encoder_0_1 = keras.layers.Dense(256)(encoder_0d)
encoder_0_1d = keras.layers.Dropout(0.2)(encoder_0_1)
encoder_0_2 = keras.layers.Dense(128)(encoder_0_1d)
encoder_0_2d = keras.layers.Dropout(0.05)(encoder_0_2)
encoder_0_3 = keras.layers.Dense(64)(encoder_0_2d)

encoder_1 = keras.layers.LSTM(64, return_state=True, return_sequences=True, recurrent_dropout=0.2)
encoder_lstm_bidirectional = keras.layers.Bidirectional(encoder_1)
encoder_output, state_h1, state_c1, state_h2, state_c2 = encoder_lstm_bidirectional(encoder_0_3)
encoder_state = [Concatenate()([state_h1, state_h2]), Concatenate()([state_c1, state_c2])]

sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(64*2)

output_layer = keras.layers.Dense(vocab_size)

decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer=output_layer)

final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, initial_state=encoder_state,
                                                             sequence_length=[sequence_length], training=True)
y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[y_proba])

If needed, I can deploy the whole notebook in github, but maybe there is a simple solution, I just did not see so far. Thanks for your help!

Solution

So, in case this helps anyone in the future: The model did exactly what I asked it to do.

BUT

You need to be careful, that your data preprocession does not lead to ambiguity. So you have to prevent something like:

a -> b and also a -> c

While improving one equatation, the other one will lose. That was my problem. See this example:

eq1: 26.04.1994 -> 26.04.1994
eq2: 26.04.1994 -> Tuesday, 26.04.1994

One the one hand, the models increases the accuracy for eq1. On the other hand, it decreases the loss for eq2. So 74% is kind of a compromise, the model found.

To solve that, I had to add another factor, which describes the data more specifically. So I added an extra condition, describing whether y is written out or just written as date type. So now I have a data structure like the following and my accuracy grew to 98%:

eq1: 26.04.1994, dateformat -> 26.04.1994
eq2: 26.04.1994, written_out -> Tuesday, 26.04.1994