Understanding the accuracy metric in TensorFlow Keras for multi-label classification tasks

I have the following model code:

model_lstm.add(Embedding(256, 250, input_length=640))
model_lstm.add(SpatialDropout1D(0.4))
model_lstm.add(LSTM(50, return_sequences=True))
model_lstm.add(LSTM(50))
model_lstm.add(Dense(12, activation='**sigmoid**'))

model_lstm.compile(optimizer='adam',
              loss='**binary_crossentropy**',
              metrics=['**accuracy**', 'AUC'])

I have seen answers that scikit-learn and Keras define the accuracy metric differently. What exactly does the metric show:

Does accuracy show that on average n% of examples in the test set were classified correctly?
Does accuracy mean that on average n% of labels in each example were predicted correctly? What is a "label" in Keras in this case: an instance that is correctly classified by all labels or each label of each instance?

Solution

In the metrics args text in the docs it is stated:

When you pass the strings 'accuracy' or 'acc', we convert this to one of tf.keras.metrics.BinaryAccuracy, tf.keras.metrics.CategoricalAccuracy, tf.keras.metrics.SparseCategoricalAccuracy based on the shapes of the targets and of the model output.

I wasn't sure if the string 'accuracy' goes to BinaryAccuracy or CategoricalAccuracy for multi-label, and how the metrics deal with multi-label input. So I did a little testing (I never worked with Multi-class labels before):

import tensorflow as tf

y =  [[0.0, 0.0, 0.0,  1.0, 1.0 ], [0.0, 0.0, 1.0, 1.0, 0.0]] #  ground truth
x =  [[0.2, 0.1, 0.05, 0.6, 0.05], [0.1, 0.1, 0.6, 0.9, 0.2]]
x2 = [[0.2, 0.1, 0.05, 0.6, 0.05], [0.1, 0.1, 0.9, 0.6, 0.2]]

print(tf.keras.metrics.categorical_accuracy(y, x))
# tf.Tensor([1. 0.], shape=(2,), dtype=float32)
print(tf.keras.metrics.categorical_accuracy(y, x2))
# tf.Tensor([1. 1.], shape=(2,), dtype=float32)
print(tf.keras.metrics.binary_accuracy(y, x))
# tf.Tensor([0.8 1. ], shape=(2,), dtype=float32)
print(tf.keras.metrics.binary_accuracy(y, x2))
# tf.Tensor([0.8 1. ], shape=(2,), dtype=float32)

y is the ground truth for two samples with multiple labels. x and x2 are almost the same, both has only one label correct in the first sample, and both have both labels correct in the second one. The difference is that I switched the logits for the correct classes in the second sample (0.6 and 0.9).

You can see that categorical crossentropy doesn't really work here. The reason is that internally, it uses argmax on the ground truth labels, and then does sparse_categorical_crossentropy. This is the reason why for x it has 0.0 accuracy for the second sample, and for x2 it is 1.0. It just checks for the first 1 in the ground truth, if that is also the highest logit in the predicition.
Binary crossentropy on the other hand looks better. It just compares how many 0's and 1's fit after thresholding the logits with a threshold of 0.5. Everything under 0.5 goes to 0, everything else goes to 1.

I did another test to see what the model would pick with 'accuracy':

model = tf.keras.Sequential([tf.keras.layers.Lambda(lambda x: x)])
model.compile(loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.BinaryAccuracy()])
pred = model.evaluate(x, y)
# 1/1 [==============================] - 0s 141ms/step - loss: 0.4936 - accuracy: 0.5000 - binary_accuracy: 0.9000
pred = model.evaluate(x2, y)
# 1/1 [==============================] - 0s 37ms/step - loss: 0.4936 - accuracy: 1.0000 - binary_accuracy: 0.9000

The model has nothing to learn, the lambda layer just troughputs the data. It seems that 'accuracy' picks categoical crossentropy. This could be because the model here is so basic and weirdly shaped, that TF can't detect that the problem is multi-label. But in general, I'd be careful with the string 'accuracy' as a metric when dealing with multi-label classification.