I'm trying to fit a model for a decision tree classifier with scikit-learn module. I have 5 features and one of those is categorical, not numerical
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv()
labelEncoders = {}
for column in df.dtypes[df.dtypes == 'object'].index:
labelEncoders[column] = LabelEncoder()
df[column] = labelEncoders[column].fit_transform(df[column])
print(labelEncoders[column].inverse_transform([0, 1, 2])) #['High', 'Low', 'Normal']
I'm new to ML and I've been reading about the need to encode categorical features before feeding the data frame to the model, and how there are encoding variants like label encoding and one hot encoding.
Now, according to most literature, label encoding should or could be used when the values of the feature can be naturally ordered, for instance, 'Low', 'Normal', 'High'; otherwise one should use one hot encoding so the model doesn't establish a misleading order relationship between the values when there is none that would make sense semantically, for example, 'Brazil', 'Congo', 'Czech Republic'.
So, that's where I'm at with the logic behind choosing a coding strategy, and that's why I'm asking this:
how can I make scikit-learn's LabelEncoder
keep the natural order of the values, how can I make it encode like this:
Low -> 0
Normal -> 1
High -> 2
and NOT the way it's doing it now:
High -> 0
Low -> 1
Normal -> 2
Can this be done at all? Is it actually the encoder's task? Do I have to do it somewhere else before the encoding?
Thanks
You can use pandas' replace function pandas.DataFrame.replace()
to explicitly pass in the encodings you want to use. As an example:
import pandas as pd
df = pd.DataFrame(data={
"ID": [1, 2, 3, 4, 5],
"Label": ["Low", "High", "Low", "High", "Normal"],
})
print("Original:")
print(df)
label_mapping = {"Low": 0, "Normal": 1, "High": 2}
df = df.replace({"Label": label_mapping})
print("Mapped:")
print(df)
Output:
Original:
ID Label
0 1 Low
1 2 High
2 3 Low
3 4 High
4 5 Normal
Mapped:
ID Label
0 1 0
1 2 2
2 3 0
3 4 2
4 5 1