Search code examples
pythontensorflowkerasnlpkaggle

Multi-label classification implementation


So far I have used Keras Tensorflow to model image processing, NLP, time series prediction. Usually in case of having labels with multiple entries, so multiple categories the task was always to just predict to which class the sample belongs. So for example the list of possible classes was [car, human, airplane, flower, building]. So the final prediction was to which class the sample belongs - giving probabilities for each class. Usually in terms of a very confident prediction one class had a very high probability and the others very low.

Now I came across this Kaggle challenge: Toxic Comment Classification Challenge and in specific this implementation. I thought that this is a multi-label classification problem, as one sample can belong to different classes. And indeed, when I check the final prediction:

ex1

I can see that the first sample prediction has a very high probability for both toxic and obscene. With my knowledge so far when I applied a standard model to predict a class I would have predicted the probability to which of this class the sample belongs. So either class 1 or 2 or .... so I would have had - in case of a confident prediciton - a high probability for class toxic and low for the others - or in case of unconfident prediciton - 0.4x for toxic, 0.4x for obscene and small probability for the rest.

Now I was suprised of how the implementation was done resp. I do not understand the following: How is a multi-label classification done (in opposite to the "usual" model)?

When checking the code I see the following model:

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

I understand that x = Dense(6, activation="sigmoid") results from having to predict 6 classes. Same would be with my knowledge so far. However, why is then a probability resulting for a mulit-label classification? Where is the implementation difference between multi-label classification and just predicting one-label out of different choices?

Is it the simple difference of using binary crossentropy and not (sparse) categorical crossentropy along with 6 classes? So that tells that we have a binary problem for each of the classes and it handles these 6 classes separately, so giving a probability for each class that the sample belongs to this class and therefore it can have high probability to belonging to different classes?


Solution

  • The loss function to be used is indeed the binary_crossentropy with a sigmoid activation.

    The categorical_crossentropy is not suitable for multi-label problems, because in case of the multi-label problems, the labels are not mutually exclusive. Repeat the last sentence: the labels are not mutually exclusive.

    This means that the presence of a label in the form [1,0,1,0,0,0] is correct. The categorical_crossentropy and softmax will always tend to favour one specific class, but this is not the case; just like you saw, a comment can be both toxic and obscene.

    Now imagine photos with cats and dogs inside them. What happens if we have 2 dogs and 2 cats inside a photo? Is it a dog picture or a cat picture? It is actually a "both" picture! We definitely need a way to specify that multiple labels are pertained/related to a photo/label.

    The rationale for using the binary_crossentropy and sigmoid for multi-label classification resides in the mathematical properties, in that each output needs to be treated as in independent Bernoulli distribution.

    Therefore, the only correct solution is BCE + 'sigmoid'.