Search code examples
pythonmachine-learningencodingscikit-learnone-hot-encoding

How to encode data with multiple class labels?


I have a classification problem with multiple classes, say A, B, C and D. My data has the following y labels:

y0 = [['A'], ['B'], ['A','D'], ['A'], ['A','C','D'], ['D'], ..., ['C'], ['A','B','C','D'] , ['B']]

I want to train a Random Forest classifier on these labels. First I need to encode the labels. I first tried LabelEncoder:

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
le.fit_transform(y0)
# encoded labels: array([0, 1, 2, 0, 3, 4, ... 5, 6, 1], dtype=int64)

I also tried OneHotEncoder, but obviously, neither LabelEncoder nor OneHotEncoder would work here. The thing is that I cannot encode data with multiple class labels (e.g. ['A','B','C']). I guess these trivial encoding methods are not the way to go here, so what is the best way to encode my class labels? To clarify, I don't want to treat e.g. ['A','B'] as a completely different class from ['A'] or ['B']. I want it to be a different class but at the same time still inherit features from both A and B classes.


Solution

  • This kind of problem is called multilabel (as opposed to multiclass where each sample has exactly one class label), and sklearn expects multilabel problems to have the target encoded as a binary array of shape (n_samples, n_labels). You can encode your data in that format using MultiLabelBinarizer.