Search code examples
tensorflowdeep-learningnlp

Loss function for Image captioning with visual attention


I am trying to understand the TensorFlow implementation of Image captioning with visual attention. I understand what SparseCategoricalCrossentropy is but what is loss_function doing? Can someone explain? Tensorflow Implementation

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

Solution

  • We need to go back to what is in real. In real we have words encoded as number with tf.keras.preprocessing.text.Tokenizer. In the tutorial, the value 0 is for the <pad> token.

    tokenizer.word_index['<pad>'] = 0
    

    So, the loss function simply apply a mask to discard the predictions made on the <pad> tokens, because they don't provide meaningful information for the training of the network.