Search code examples
pythonpandasscikit-learnsklearn-pandas

How to order categorical string features in order of severity?


If one of the features for my data set is a score that is categorical string like:

Score
X1c
X3a
X1a
X2b
X4
X1a
X1b
X4

Where X1a is the weakest followed by X1b, X1c, X2a, X2b ...X4 with X4 being the strongest, how can I encode it to integers such that X1a can be the lowest int and X4 be the highest int. I'm looking to use a random forest classifier. Also, the training set is a separate data set so this encoding should be maintained for new data sets.


Solution

  • You can try using rank:

    df['Score_int'] = df.Score.rank(method='dense')
    

    Output:

      Score  Score_int
    0   X1c        3.0
    1   X3a        5.0
    2   X1a        1.0
    3   X2b        4.0
    4    X4        6.0
    5   X1a        1.0
    6   X1b        2.0
    7    X4        6.0