Search code examples
scikit-learnclassificationdecision-treeone-hot-encodinglabel-encoding

how to maintain natural order when label encoding with scikit learn


I'm trying to fit a model for a decision tree classifier with scikit-learn module. I have 5 features and one of those is categorical, not numerical

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv()
labelEncoders = {}
for column in df.dtypes[df.dtypes == 'object'].index:
    labelEncoders[column] = LabelEncoder()
    df[column] = labelEncoders[column].fit_transform(df[column])
    print(labelEncoders[column].inverse_transform([0, 1, 2])) #['High', 'Low', 'Normal']

I'm new to ML and I've been reading about the need to encode categorical features before feeding the data frame to the model, and how there are encoding variants like label encoding and one hot encoding.

Now, according to most literature, label encoding should or could be used when the values of the feature can be naturally ordered, for instance, 'Low', 'Normal', 'High'; otherwise one should use one hot encoding so the model doesn't establish a misleading order relationship between the values when there is none that would make sense semantically, for example, 'Brazil', 'Congo', 'Czech Republic'.

So, that's where I'm at with the logic behind choosing a coding strategy, and that's why I'm asking this:

how can I make scikit-learn's LabelEncoder keep the natural order of the values, how can I make it encode like this:

Low -> 0
Normal -> 1
High -> 2

and NOT the way it's doing it now:

High -> 0
Low -> 1
Normal -> 2

Can this be done at all? Is it actually the encoder's task? Do I have to do it somewhere else before the encoding?

Thanks


Solution

  • You can use pandas' replace function pandas.DataFrame.replace() to explicitly pass in the encodings you want to use. As an example:

    import pandas as pd
    
    df = pd.DataFrame(data={
        "ID": [1, 2, 3, 4, 5],
        "Label": ["Low", "High", "Low", "High", "Normal"],
    })
    
    print("Original:")
    print(df)
    
    label_mapping = {"Low": 0, "Normal": 1, "High": 2}
    df = df.replace({"Label": label_mapping})
    
    print("Mapped:")
    print(df)
    

    Output:

    Original:
       ID   Label
    0   1     Low
    1   2    High
    2   3     Low
    3   4    High
    4   5  Normal
    Mapped:
       ID  Label
    0   1      0
    1   2      2
    2   3      0
    3   4      2
    4   5      1