I am learning about tensorflow, and seq2seq problems for machine translation. For this I gave me the following task:
I created an Excel, containing random dates in different types, for example:
In my dataset, each type is occuring 1000 times. These are my train (X) value. My target (Y) values are in one half always in this type:
And in another half in this type:
To make the model beeing able to differentiate these two Y values, another context information (C) is given as text:
So some rows look like this:
So my goal is, to create a model, which is able to "translate" any date type to the german date type e.g. 05.09.2192.
The dataset contains 34.000 pairs.
To solve this, I use a character based tokenizer to transform text into integers:
tokenizer = keras.preprocessing.text.Tokenizer(filters='', char_level=True, oov_token="|")
I use an LSTM encoder-decoder model and I expect it, to reach an perfect accuracy, since the dependency between X and Y can be solved perfectly.
However, I reach up to an maximum of 72% of accuracy. Even worse, the accuracy is only reaching that much, because the padding is generated well. E.g. most of the Y values are pretty short and are therefore padded. So 12.02.2001
becomes e.g. ||||||||||||||||||||12.02.2001
. So the model learns well to generate the padding token, but not the expected value.
This is the model structure I used at my latest test:
from tensorflow.keras.layers import Concatenate
encoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
decoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, 1)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder_0 = keras.layers.Dense(128)(encoder_embeddings)
encoder_0d = keras.layers.Dropout(0.4)(encoder_0)
encoder_0_1 = keras.layers.Dense(256)(encoder_0d)
encoder_0_1d = keras.layers.Dropout(0.2)(encoder_0_1)
encoder_0_2 = keras.layers.Dense(128)(encoder_0_1d)
encoder_0_2d = keras.layers.Dropout(0.05)(encoder_0_2)
encoder_0_3 = keras.layers.Dense(64)(encoder_0_2d)
encoder_1 = keras.layers.LSTM(64, return_state=True, return_sequences=True, recurrent_dropout=0.2)
encoder_lstm_bidirectional = keras.layers.Bidirectional(encoder_1)
encoder_output, state_h1, state_c1, state_h2, state_c2 = encoder_lstm_bidirectional(encoder_0_3)
encoder_state = [Concatenate()([state_h1, state_h2]), Concatenate()([state_c1, state_c2])]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(64*2)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, initial_state=encoder_state,
sequence_length=[sequence_length], training=True)
y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[y_proba])
If needed, I can deploy the whole notebook in github, but maybe there is a simple solution, I just did not see so far. Thanks for your help!
So, in case this helps anyone in the future: The model did exactly what I asked it to do.
BUT
You need to be careful, that your data preprocession does not lead to ambiguity. So you have to prevent something like:
a -> b and also a -> c
While improving one equatation, the other one will lose. That was my problem. See this example:
eq1: 26.04.1994 -> 26.04.1994
eq2: 26.04.1994 -> Tuesday, 26.04.1994
One the one hand, the models increases the accuracy for eq1. On the other hand, it decreases the loss for eq2. So 74% is kind of a compromise, the model found.
To solve that, I had to add another factor, which describes the data more specifically. So I added an extra condition, describing whether y is written out or just written as date type. So now I have a data structure like the following and my accuracy grew to 98%:
eq1: 26.04.1994, dateformat -> 26.04.1994
eq2: 26.04.1994, written_out -> Tuesday, 26.04.1994