Search code examples
pythonpython-3.xpandasscikit-learnencode

Using OrdinalEnconder() to transform columns with predefined numerical values


I have a dataframe like this:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
                   'salary': ["low", "medium", "medium", "high", "high"],
                   'tenure': [5,6,6,8,5],
                  })
df


   department  salary  tenure
0  operations     low       5
1  operations  medium       6
2     support  medium       6
3      logics    high       8
4       sales    high       5

I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.

however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.

oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df

    department  salary  tenure
0   operations  1.0     5
1   operations  2.0     6
2   support     2.0     6
3   logics      0.0     8
4   sales       0.0     5

I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.


Solution

  • If you want to perform this operation using OrdinalEncoder, you can use the categories parameter to specify the ordering.

    As follows:

    OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])
    

    Output:

    array([[0.],
           [1.],
           [1.],
           [2.],
           [2.]])