I have the following model code:
model_lstm.add(Embedding(256, 250, input_length=640))
model_lstm.add(SpatialDropout1D(0.4))
model_lstm.add(LSTM(50, return_sequences=True))
model_lstm.add(LSTM(50))
model_lstm.add(Dense(12, activation='**sigmoid**'))
model_lstm.compile(optimizer='adam',
loss='**binary_crossentropy**',
metrics=['**accuracy**', 'AUC'])
I have seen answers that scikit-learn and Keras define the accuracy metric differently. What exactly does the metric show:
In the metrics args text in the docs it is stated:
When you pass the strings 'accuracy' or 'acc', we convert this to one of tf.keras.metrics.BinaryAccuracy, tf.keras.metrics.CategoricalAccuracy, tf.keras.metrics.SparseCategoricalAccuracy based on the shapes of the targets and of the model output.
I wasn't sure if the string 'accuracy' goes to BinaryAccuracy
or CategoricalAccuracy
for multi-label, and how the metrics deal with multi-label input. So I did a little testing (I never worked with Multi-class labels before):
import tensorflow as tf
y = [[0.0, 0.0, 0.0, 1.0, 1.0 ], [0.0, 0.0, 1.0, 1.0, 0.0]] # ground truth
x = [[0.2, 0.1, 0.05, 0.6, 0.05], [0.1, 0.1, 0.6, 0.9, 0.2]]
x2 = [[0.2, 0.1, 0.05, 0.6, 0.05], [0.1, 0.1, 0.9, 0.6, 0.2]]
print(tf.keras.metrics.categorical_accuracy(y, x))
# tf.Tensor([1. 0.], shape=(2,), dtype=float32)
print(tf.keras.metrics.categorical_accuracy(y, x2))
# tf.Tensor([1. 1.], shape=(2,), dtype=float32)
print(tf.keras.metrics.binary_accuracy(y, x))
# tf.Tensor([0.8 1. ], shape=(2,), dtype=float32)
print(tf.keras.metrics.binary_accuracy(y, x2))
# tf.Tensor([0.8 1. ], shape=(2,), dtype=float32)
y
is the ground truth for two samples with multiple labels. x
and x2
are almost the same, both has only one label correct in the first sample, and both have both labels correct in the second one. The difference is that I switched the logits for the correct classes in the second sample (0.6 and 0.9).
You can see that categorical crossentropy doesn't really work here. The reason is that internally, it uses argmax
on the ground truth labels, and then does sparse_categorical_crossentropy
. This is the reason why for x
it has 0.0 accuracy for the second sample, and for x2
it is 1.0. It just checks for the first 1 in the ground truth, if that is also the highest logit in the predicition.
Binary crossentropy on the other hand looks better. It just compares how many 0's and 1's fit after thresholding the logits with a threshold of 0.5. Everything under 0.5 goes to 0, everything else goes to 1.
I did another test to see what the model would pick with 'accuracy':
model = tf.keras.Sequential([tf.keras.layers.Lambda(lambda x: x)])
model.compile(loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.BinaryAccuracy()])
pred = model.evaluate(x, y)
# 1/1 [==============================] - 0s 141ms/step - loss: 0.4936 - accuracy: 0.5000 - binary_accuracy: 0.9000
pred = model.evaluate(x2, y)
# 1/1 [==============================] - 0s 37ms/step - loss: 0.4936 - accuracy: 1.0000 - binary_accuracy: 0.9000
The model has nothing to learn, the lambda layer just troughputs the data. It seems that 'accuracy' picks categoical crossentropy
. This could be because the model here is so basic and weirdly shaped, that TF can't detect that the problem is multi-label. But in general, I'd be careful with the string 'accuracy' as a metric when dealing with multi-label classification.