My understanding the two loss function's results are same if CCE's input is softmaxed. However as the example below show, when x values are small the result of the two are similar. But they become very different when x's values x10.
import numpy as np
import tensorflow as tf
X = np.array([[3.0, 1.0, 1.0], [-1.0, 2.0, 5.0]])
Y = np.array([[1, 0.0, 0.0], [0.0, 1.0, 0.0]])
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=False, reduction=tf.keras.losses.Reduction.NONE)
cce_res = cce(y_true = Y, y_pred=tf.math.softmax(X))
sce_res = tf.nn.softmax_cross_entropy_with_logits(logits=X, labels=Y)
# print(cce_res)
# print(sce_res)
cost1 = tf.reduce_mean(cce_res)
cost2 = tf.reduce_mean(sce_res)
print(cost1)
print(cost2)
# tf.Tensor(1.645245261490345, shape=(), dtype=float64)
# tf.Tensor(1.6452452648724416, shape=(), dtype=float64)
X = np.array([[3.0, 1.0, 1.0], [-1.0, 2.0, 5.0]]) * 10
Y = np.array([[1, 0.0, 0.0], [0.0, 1.0, 0.0]])
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=False, reduction=tf.keras.losses.Reduction.NONE)
cce_res = cce(y_true = Y, y_pred=tf.math.softmax(X))
sce_res = tf.nn.softmax_cross_entropy_with_logits(logits=X, labels=Y)
cost1 = tf.reduce_mean(cce_res)
cost2 = tf.reduce_mean(sce_res)
print(cost1)
print(cost2)
# tf.Tensor(8.059047748974614, shape=(), dtype=float64)
# tf.Tensor(15.0000000020612, shape=(), dtype=float64)
Yes, we expect same values for when it is from_logits=False
and tf.nn.softmax()
is applied to the outputs.
There is an inconsistency, and here's what happens under the hood:
X_softmax = tf.math.softmax(X)
<tf.Tensor: shape=(2, 3), dtype=float64, numpy=
array([[9.99999996e-01, 2.06115361e-09, 2.06115361e-09],
[8.75651076e-27, 9.35762297e-14, 1.00000000e+00]])>
Values are clipped, using epsilon as 1e-7
:
X_softmax_clipped = tf.clip_by_value(X_softmax, 1e-7, 1.0 - 1e-7)
<tf.Tensor: shape=(2, 3), dtype=float64, numpy=
array([[9.999999e-01, 1.000000e-07, 1.000000e-07],
[1.000000e-07, 1.000000e-07, 9.999999e-01]])>
Calculating CE in your example:
tf.reduce_mean(-tf.reduce_sum(Y * tf.math.log(X_softmax_clipped), -1))
<tf.Tensor: shape=(), dtype=float64, numpy=8.059047875479163>
So the problem is the epsilon value here, 1e-7
. If you lower it to a value, say, 1e-30
you'll get identical results with tf.nn.softmax_cross_entropy_with_logits
.
That's why (I guess) it is mentioned that using from_logits=True
is more stable.