When I calculate Binary Crossentropy by hand I apply sigmoid to get probabilities, then use Cross-Entropy formula and mean the result:
logits = tf.constant([-1, -1, 0, 1, 2.])
labels = tf.constant([0, 0, 1, 1, 1.])
probs = tf.nn.sigmoid(logits)
loss = labels * (-tf.math.log(probs)) + (1 - labels) * (-tf.math.log(1 - probs))
print(tf.reduce_mean(loss).numpy()) # 0.35197204
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True)
loss = cross_entropy(labels, logits)
print(loss.numpy()) # 0.35197204
How to calculate Categorical Cross-Entropy when logits
and labels
have different sizes?
logits = tf.constant([[-3.27133679, -22.6687183, -4.15501118, -5.14916372, -5.94609261,
-6.93373299, -5.72364092, -9.75725174, -3.15748906, -4.84012318],
[-11.7642536, -45.3370094, -3.17252636, 4.34527206, -17.7164974,
-0.595088899, -17.6322937, -2.36941719, -6.82157373, -3.47369862],
[-4.55468369, -1.07379043, -3.73261762, -7.08982277, -0.0288562477,
-5.46847963, -0.979336262, -3.03667569, -3.29502845, -2.25880361]])
labels = tf.constant([2, 3, 4])
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
reduction='none')
loss = loss_object(labels, logits)
print(loss.numpy()) # [2.0077195 0.00928135 0.6800677 ]
print(tf.reduce_mean(loss).numpy()) # 0.8990229
I mean how can I get the same result ([2.0077195 0.00928135 0.6800677 ]
) by hand?
@OverLordGoldDragon answer is correct. In TF 2.0
it looks like this:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
loss = loss_object(labels, logits)
print(f'{loss.numpy()}\n{tf.math.reduce_sum(loss).numpy()}')
one_hot_labels = tf.one_hot(labels, 10)
preds = tf.nn.softmax(logits)
preds /= tf.math.reduce_sum(preds, axis=-1, keepdims=True)
loss = tf.math.reduce_sum(tf.math.multiply(one_hot_labels, -tf.math.log(preds)), axis=-1)
print(f'{loss.numpy()}\n{tf.math.reduce_sum(loss).numpy()}')
# [2.0077195 0.00928135 0.6800677 ]
# 2.697068691253662
# [2.0077198 0.00928142 0.6800677 ]
# 2.697068929672241
For language models:
vocab_size = 9
seq_len = 6
batch_size = 2
labels = tf.reshape(tf.range(batch_size*seq_len), (batch_size,seq_len)) # (2, 6)
logits = tf.random.normal((batch_size,seq_len,vocab_size)) # (2, 6, 9)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
loss = loss_object(labels, logits)
print(f'{loss.numpy()}\n{tf.math.reduce_sum(loss).numpy()}')
one_hot_labels = tf.one_hot(labels, vocab_size)
preds = tf.nn.softmax(logits)
preds /= tf.math.reduce_sum(preds, axis=-1, keepdims=True)
loss = tf.math.reduce_sum(tf.math.multiply(one_hot_labels, -tf.math.log(preds)), axis=-1)
print(f'{loss.numpy()}\n{tf.math.reduce_sum(loss).numpy()}')
# [[1.341706 3.2518263 2.6482694 3.039099 1.5835983 4.3498387]
# [2.67237 3.3978183 2.8657475 nan nan nan]]
# nan
# [[1.341706 3.2518263 2.6482694 3.039099 1.5835984 4.3498387]
# [2.67237 3.3978183 2.8657475 0. 0. 0. ]]
# 25.1502742767334
SparseCategoricalCrossentropy
is CategoricalCrossentropy
that takes integer labels as opposed to one-hot. Example from source code, the two below are equivalent:
scce = tf.keras.losses.SparseCategoricalCrossentropy()
cce = tf.keras.losses.CategoricalCrossentropy()
labels_scce = K.variable([[0, 1, 2]])
labels_cce = K.variable([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
preds = K.variable([[.90,.05,.05], [.50,.89,.60], [.05,.01,.94]])
loss_cce = cce(labels_cce, preds, from_logits=False)
loss_scce = scce(labels_scce, preds, from_logits=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run([loss_cce, loss_scce])
print(K.get_value(loss_cce))
print(K.get_value(loss_scce))
# [0.10536055 0.8046684 0.0618754]
# [0.10536055 0.8046684 0.0618754]
As to how to do it 'by hand', we can refer to the Numpy backend:
np_labels = K.get_value(labels_cce)
np_preds = K.get_value(preds)
losses = []
for label, pred in zip(np_labels, np_preds):
pred /= pred.sum(axis=-1, keepdims=True)
losses.append(np.sum(label * -np.log(pred), axis=-1, keepdims=False))
print(losses)
# [0.10536055 0.8046684 0.0618754]
from_logits = True
: preds
is model output before passing it into softmax
(so we pass it into softmax)from_logits = False
: preds
is model output after passing it into softmax
(so we skip this step)So in summary, to compute it by hand:
pred /= ...
normalizes predictions before computing logs; this way, high-probab. preds on zero-labels penalize correct predictions on one-labels. If from_logits = False
, this step is skipped, since softmax
does the normalization. See this snippet. Further readinglog
(base e) only where label==1
Lastly, the mathematical formula for categorical crossentropy is:
i
iterates over N
observationsc
iterates over C
classesC
vectorsp_model [y_i \in C_c]
- predicted probability of observation i
belonging to class c