python numpy machine-learning encoding multiclass-classification

Correct way of one-hot-encoding class labels for multi-class problem

I have a classification problem with multiple classes, let's call them A, B, C and D. My data has the following shape:

X=[#samples, #features, 1], y=[#samples,1].

To be more specific, the y looks like this:

[['A'], ['B'], ['D'], ['A'], ['C'], ...]

When I train a Random Forest classifier on these labels, this works fine, however I read multiple times that class labels also need to be one hot encoded. After the one hot encoding, y is

[[1,0,0,0], [0,1,0,0], ...]

and has the shape

[#samples, 4]

The problem arises when I try to use this as classifier input. The model predicts every one of the four labels individually, meaning that it is also able to produce an output like [0 0 0 0], which I don't want. rfc.classes_ returns

# [array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

How would I tell the model that the labels are one hot encoded instead of multiple labels which shall be predicted independently of each other? Do I need to change my y or do I need to alter some settings of the model?

Solution

You don't have to make one hot encoding when using random forest in sklearn.

What you need is "label encoder", and your Y should looks like

from sklearn.preprocessing import LabelEncoder
y = ["A","B","D","A","C"]
le = LabelEncoder()
le.fit_transform(y)
# array([0, 1, 3, 0, 2], dtype=int64)

I tried to modified the sample code sklearn provided :

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import make_classification

>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
y = np.random.choice(["A","B","C","D"],1000)
print(y.shape)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
>>> clf.classes_
# array(['A', 'B', 'C', 'D'], dtype='<U1')

Either process the y with label encoding or without, it both worked with RandomForestClassifier.