Search code examples
tensorflowkerastranslation

Transformers model keep giving the same translation result


I'm using the official Transformers Tutorial code to do some modifications with my own text dataset (sadly only 500+ pairs of examples) on the translation tasks. With only changes on the tokenizer (I used the tf.keras.preprocessing.text.Tokenizer() to fit on my own text dataset), the transformers model trains well and the last epoch get this:

Epoch 30 Batch 50 Loss 0.0677 Accuracy 0.9823

But when I'm using the trained translator, all the results are the same no matter what input text it gets. The result is actually pretty fluent and reasonable (sure generated, not one of the training set) but it is not relevant with the input text.

My parameters are:

num_layers = 4

d_model = 128

dff = 512

num_heads = 8

dropout_rate = 0.1

BUFFER_SIZE = 20000

BATCH_SIZE = 64

EPOCHS = 30

MAX_TOKENS = 413

I know it must have something to do with the dataset, but does anybody have the same problem? Did it converge into a local minimal? What is the key problem?


Solution

  • This is indicative of a model that collapsed, or failed to learn. It usually means something major is amiss. Debugging neural networks is not trivial as there are no signposts, no syntax errors telling you which way to go. I typically start with visualizing the actual data into the model (not what you think you passed in, but what the model is actually holding on to, errors in preprocessing are abundantly common, at least in my code). But there are many good tutorials on debugging neural networks, so I would guide you towards walking through those tutorials.

    Here's one such tutorial from one of my most trusted sources for data science information:

    https://towardsdatascience.com/checklist-for-debugging-neural-networks-d8b2a9434f21

    I suggest you start working through that or similar tutorials. As you run into more specific questions you need help with, post separate, dedicated questions, with specifics related to that debugging step (ideally with some code or samples and expected input/output if possible). As-is your question is too broad to give a silver bullet answer to.