If one of the features for my data set is a score that is categorical string like:
Score
X1c
X3a
X1a
X2b
X4
X1a
X1b
X4
Where X1a
is the weakest followed by X1b, X1c, X2a, X2b ...X4
with X4
being the strongest, how can I encode it to integers such that X1a
can be the lowest int and X4
be the highest int. I'm looking to use a random forest classifier. Also, the training set is a separate data set so this encoding should be maintained for new data sets.
You can try using rank:
df['Score_int'] = df.Score.rank(method='dense')
Output:
Score Score_int
0 X1c 3.0
1 X3a 5.0
2 X1a 1.0
3 X2b 4.0
4 X4 6.0
5 X1a 1.0
6 X1b 2.0
7 X4 6.0