python machine-learning encoding scikit-learn one-hot-encoding

How to encode data with multiple class labels?

I have a classification problem with multiple classes, say A, B, C and D. My data has the following y labels:

y0 = [['A'], ['B'], ['A','D'], ['A'], ['A','C','D'], ['D'], ..., ['C'], ['A','B','C','D'] , ['B']]

I want to train a Random Forest classifier on these labels. First I need to encode the labels. I first tried LabelEncoder:

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
le.fit_transform(y0)
# encoded labels: array([0, 1, 2, 0, 3, 4, ... 5, 6, 1], dtype=int64)

I also tried OneHotEncoder, but obviously, neither LabelEncoder nor OneHotEncoder would work here. The thing is that I cannot encode data with multiple class labels (e.g. ['A','B','C']). I guess these trivial encoding methods are not the way to go here, so what is the best way to encode my class labels? To clarify, I don't want to treat e.g. ['A','B'] as a completely different class from ['A'] or ['B']. I want it to be a different class but at the same time still inherit features from both A and B classes.

Solution

This kind of problem is called multilabel (as opposed to multiclass where each sample has exactly one class label), and sklearn expects multilabel problems to have the target encoded as a binary array of shape (n_samples, n_labels). You can encode your data in that format using MultiLabelBinarizer.