I have a dataframe like this:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
'salary': ["low", "medium", "medium", "high", "high"],
'tenure': [5,6,6,8,5],
})
df
department salary tenure
0 operations low 5
1 operations medium 6
2 support medium 6
3 logics high 8
4 sales high 5
I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.
however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.
oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df
department salary tenure
0 operations 1.0 5
1 operations 2.0 6
2 support 2.0 6
3 logics 0.0 8
4 sales 0.0 5
I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.
If you want to perform this operation using OrdinalEncoder
, you can use the categories
parameter to specify the ordering.
As follows:
OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])
Output:
array([[0.],
[1.],
[1.],
[2.],
[2.]])