Search code examples
pythontensorflowmachine-learningkerasloss-function

What should I use as target vector when I use BinaryCrossentropy(from_logits=True) in tensorflow.keras


I have a multi-label classification in which each target is a vector of ones and zeros not mutually exclusive (for the sake of clarity, my target is something like [0, 1, 0, 0, 1, 1, ... ]).

My understanding so far is:

  • I should use a binary cross-entropy function. (as explained in this answer)

  • Also, I understood that tf.keras.losses.BinaryCrossentropy() is a wrapper around tensorflow's sigmoid_cross_entropy_with_logits. This can be used either with from_logits True or False. (as explained in this question)

  • Since sigmoid_cross_entropy_with_logits performs itself the sigmoid, it expects the input to be in the [-inf,+inf] range.

  • tf.keras.losses.BinaryCrossentropy(), when the network implements itself a sigmoid activation of the last layer, must be used with from_logits=False. It will then infert the sigmoid function and pass the output to sigmoid_cross_entropy_with_logits that will do the sigmoid again. This however can cause numerical issues due to the asymptotes of the sigmoid/logit function.

  • To improve the numerical stability, we can avoid the last sigmoid layer and use tf.keras.losses.BinaryCrossentropy(from_logits=False)

Question:

If we use tf.keras.losses.BinaryCrossentropy(from_logits=False), what target should I use? Do I need to change my target for the one-hot vector?

I suppose that I should apply then a sigmoid activation to the network output at inference time. Is there a way to add a sigmoid layer active only in inference mode and not in training mode?


Solution

  • First, let me give some notes about the numerical stability:

    As mentioned in the comments section, the numerical instability in case of using from_logits=False comes from the transformation of probability values back into logits which involves a clipping operation (as discussed in this question and its answer). However, to the best of my knowledge, this does NOT create any serious issues for most of practical applications (although, there are some cases where applying the softmax/sigmoid function inside the loss function, i.e. using from_logits=True, would be more numerically stable in terms of computing gradients; see this answer for a mathematical explanation).

    In other words, if you are not concerned with precision of generated probability values with sensitivity of less than 1e-7, or a related convergence issue observed in your experiments, then you should not worry too much; just use the sigmoid and binary cross-entropy as before, i.e. model.compile(loss='binary_crossentropy', ...), and it would work fine.

    All in all, if you are really concerned with numerical stability, you can take the safest path and use from_logits=True without using any activation function on the last layer of the model.


    Now, to answer the original question, the true labels or target values (i.e. y_true) should be still only zeros or ones when using BinaryCrossentropy(from_logits=True). Rather, that's the y_pred (i.e. the output of the model) which should not be a probability distribution in this case (i.e. the sigmoid function should not be used on the last layer if from_logits=True).