Why tf.keras.losses.CategoricalCrossentropy result different to tf.nn.softmax_cross_entropy_with_logits when input changes

My understanding the two loss function's results are same if CCE's input is softmaxed. However as the example below show, when x values are small the result of the two are similar. But they become very different when x's values x10.

import numpy as np
import tensorflow as tf

X = np.array([[3.0, 1.0, 1.0], [-1.0, 2.0, 5.0]])
Y = np.array([[1, 0.0, 0.0], [0.0, 1.0, 0.0]])
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=False, reduction=tf.keras.losses.Reduction.NONE)
cce_res = cce(y_true = Y, y_pred=tf.math.softmax(X))
sce_res = tf.nn.softmax_cross_entropy_with_logits(logits=X, labels=Y)
# print(cce_res)
# print(sce_res)
cost1 = tf.reduce_mean(cce_res)
cost2 = tf.reduce_mean(sce_res)
print(cost1)
print(cost2)
# tf.Tensor(1.645245261490345, shape=(), dtype=float64)
# tf.Tensor(1.6452452648724416, shape=(), dtype=float64)


X = np.array([[3.0, 1.0, 1.0], [-1.0, 2.0, 5.0]]) * 10
Y = np.array([[1, 0.0, 0.0], [0.0, 1.0, 0.0]])
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=False, reduction=tf.keras.losses.Reduction.NONE)
cce_res = cce(y_true = Y, y_pred=tf.math.softmax(X))
sce_res = tf.nn.softmax_cross_entropy_with_logits(logits=X, labels=Y)
cost1 = tf.reduce_mean(cce_res)
cost2 = tf.reduce_mean(sce_res)
print(cost1)
print(cost2)
# tf.Tensor(8.059047748974614, shape=(), dtype=float64)
# tf.Tensor(15.0000000020612, shape=(), dtype=float64)

Solution

Yes, we expect same values for when it is from_logits=False and tf.nn.softmax() is applied to the outputs.

There is an inconsistency, and here's what happens under the hood:

X_softmax = tf.math.softmax(X)

<tf.Tensor: shape=(2, 3), dtype=float64, numpy=
array([[9.99999996e-01, 2.06115361e-09, 2.06115361e-09],
       [8.75651076e-27, 9.35762297e-14, 1.00000000e+00]])>

Values are clipped, using epsilon as 1e-7:

X_softmax_clipped = tf.clip_by_value(X_softmax, 1e-7, 1.0 - 1e-7) 

<tf.Tensor: shape=(2, 3), dtype=float64, numpy=
array([[9.999999e-01, 1.000000e-07, 1.000000e-07],
       [1.000000e-07, 1.000000e-07, 9.999999e-01]])>

Calculating CE in your example:

tf.reduce_mean(-tf.reduce_sum(Y * tf.math.log(X_softmax_clipped), -1))

<tf.Tensor: shape=(), dtype=float64, numpy=8.059047875479163>

So the problem is the epsilon value here, 1e-7. If you lower it to a value, say, 1e-30 you'll get identical results with tf.nn.softmax_cross_entropy_with_logits.

That's why (I guess) it is mentioned that using from_logits=True is more stable.