Search code examples
machine-learningnlptransformer-modellanguage-model

How is transformers loss calculated for blank token predictions?


I'm currently trying to implement a transformer and have trouble understanding its loss calculation.

My encoders input looks for batch_size=1 and max_sentence_length=8 like:

[[Das, Wetter, ist, gut, <blank>, <blank>, <blank>, <blank>]]

My decoders input looks like (german to english):

[[<start>, The, weather, is, good, <end>, <blank>, <blank>]]

Let's say my transformer predicted those class probabilities (only showing the word for the class with the highest class probability):

[[The, good, is, weather, <end>, <blank>, <blank>, <blank>]]

Now I calculate the loss using:

loss = categorical_crossentropy(
   [[The, good, is, weather, <end>, <blank>, <blank>, <blank>]],
   [[The, weather, is, good, <end>, <blank>, <blank>, <blank>]]
)

Is this the correct way to calculate the loss? My transformer always predicts the blank token for the next word and I thought that's because I have a mistake in my loss calculation and have to do something with the blank tokens before calculating the loss.


Solution

  • You need to mask out the padding. (What you call is <blank> is more often called <pad>.)

    • Create a mask saying where the valid tokens are (pseudocode: mask = target != '<pad>')

    • When computing the categorical cross-entropy, do not automatically reduce the loss and keep the value.

    • Multiply the loss values with the mask, i.e., positions corresponding to the <blank> tokens get zero out and sum the losses at the valid positions. (pseudocode: loss_sum = (loss * mask).sum())

    • Divide the loss_sum by the number of valid position, i.e., the sum of the mask (pseudocode: loss = loss_sum / mask.sum())