Search code examples
pythonnumpymachine-learningencodingmulticlass-classification

Correct way of one-hot-encoding class labels for multi-class problem


I have a classification problem with multiple classes, let's call them A, B, C and D. My data has the following shape:

X=[#samples, #features, 1], y=[#samples,1].

To be more specific, the y looks like this:

[['A'], ['B'], ['D'], ['A'], ['C'], ...]

When I train a Random Forest classifier on these labels, this works fine, however I read multiple times that class labels also need to be one hot encoded. After the one hot encoding, y is

[[1,0,0,0], [0,1,0,0], ...]

and has the shape

[#samples, 4]

The problem arises when I try to use this as classifier input. The model predicts every one of the four labels individually, meaning that it is also able to produce an output like [0 0 0 0], which I don't want. rfc.classes_ returns

# [array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

How would I tell the model that the labels are one hot encoded instead of multiple labels which shall be predicted independently of each other? Do I need to change my y or do I need to alter some settings of the model?


Solution

  • You don't have to make one hot encoding when using random forest in sklearn.

    What you need is "label encoder", and your Y should looks like

    from sklearn.preprocessing import LabelEncoder
    y = ["A","B","D","A","C"]
    le = LabelEncoder()
    le.fit_transform(y)
    # array([0, 1, 3, 0, 2], dtype=int64)
    

    I tried to modified the sample code sklearn provided :

    from sklearn.ensemble import RandomForestClassifier
    import numpy as np
    from sklearn.datasets import make_classification
    
    >>> X, y = make_classification(n_samples=1000, n_features=4,
    ...                            n_informative=2, n_redundant=0,
    ...                            random_state=0, shuffle=False)
    y = np.random.choice(["A","B","C","D"],1000)
    print(y.shape)
    >>> clf = RandomForestClassifier(max_depth=2, random_state=0)
    >>> clf.fit(X, y)
    >>> clf.classes_
    # array(['A', 'B', 'C', 'D'], dtype='<U1')
    

    Either process the y with label encoding or without, it both worked with RandomForestClassifier.