Search code examples
pythonnumpymachine-learningcross-entropy

Intuition behind categorical cross entropy


I'm trying to make categorical cross entropy loss function to better understand intuition behind it. So far my implementation looks like this:

# Observations
y_true = np.array([[0, 1, 0], [0, 0, 1]])
y_pred = np.array([[0.05, 0.95, 0.05], [0.1, 0.8, 0.1]])

# Loss calculations
def categorical_loss():
  loss1 = -(0.0 * np.log(0.05) + 1.0 * np.log(0.95) + 0 * np.log(0.05))
  loss2 = -(0.0 * np.log(0.1) + 0.0 * np.log(0.8) + 1.0 * np.log(0.1))
  loss = (loss1 + loss2) / 2 # divided by 2 because y_true and y_pred have 2 observations and 3 classes
  return loss

# Show loss
print(categorical_loss()) # 1.176939193690798

However I do not understand how function should behave to return correct value when:

  • at least one number from y_pred is 0 or 1 because then log function returns -inf or 0 and how code implementation should look like in this case
  • at least one number from y_true is 0 because multiplication by 0 always returns 0 and value of np.log(0.95) will be discarded then and how code implementation should look like in this case as well

Solution

  • Regarding y_pred being 0 or 1, digging into the Keras backend source code for both binary_crossentropy and categorical_crossentropy, we get:

    def binary_crossentropy(target, output, from_logits=False):
        if not from_logits:
            output = np.clip(output, 1e-7, 1 - 1e-7)
            output = np.log(output / (1 - output))
        return (target * -np.log(sigmoid(output)) +
                (1 - target) * -np.log(1 - sigmoid(output)))
    
    
    def categorical_crossentropy(target, output, from_logits=False):
        if from_logits:
            output = softmax(output)
        else:
            output /= output.sum(axis=-1, keepdims=True)
        output = np.clip(output, 1e-7, 1 - 1e-7)
        return np.sum(target * -np.log(output), axis=-1, keepdims=False)
    

    from where you can clearly see that, in both functions, there is a clipping operation of the output (i.e. predictions), in order to avoid infinities from the logarithms:

    output = np.clip(output, 1e-7, 1 - 1e-7)
    

    So, here y_pred will never be exactly 0 or 1 in the underlying calculations. The handling is similar in other frameworks.

    Regarding y_true being 0, there is not any issue involved - the respective terms are set to 0, as they should be according to the mathematical definition.