machine-learning nlp transformer-model language-model

How is transformers loss calculated for blank token predictions?

I'm currently trying to implement a transformer and have trouble understanding its loss calculation.

My encoders input looks for batch_size=1 and max_sentence_length=8 like:

[[Das, Wetter, ist, gut, <blank>, <blank>, <blank>, <blank>]]

My decoders input looks like (german to english):

[[<start>, The, weather, is, good, <end>, <blank>, <blank>]]

Let's say my transformer predicted those class probabilities (only showing the word for the class with the highest class probability):

[[The, good, is, weather, <end>, <blank>, <blank>, <blank>]]

Now I calculate the loss using:

loss = categorical_crossentropy(
   [[The, good, is, weather, <end>, <blank>, <blank>, <blank>]],
   [[The, weather, is, good, <end>, <blank>, <blank>, <blank>]]
)

Is this the correct way to calculate the loss? My transformer always predicts the blank token for the next word and I thought that's because I have a mistake in my loss calculation and have to do something with the blank tokens before calculating the loss.

Solution

You need to mask out the padding. (What you call is <blank> is more often called <pad>.)

Create a mask saying where the valid tokens are (pseudocode: mask = target != '<pad>')
When computing the categorical cross-entropy, do not automatically reduce the loss and keep the value.
Multiply the loss values with the mask, i.e., positions corresponding to the <blank> tokens get zero out and sum the losses at the valid positions. (pseudocode: loss_sum = (loss * mask).sum())
Divide the loss_sum by the number of valid position, i.e., the sum of the mask (pseudocode: loss = loss_sum / mask.sum())