I have 2 dataframes X_train
and X_test
. These 2 dataframes have the same columns.
There is 1 column called levels
that needs to be changed from str
to int
. However, each dataframe's levels
columns has different unique values:
X_train
has: ['Level 0', 'Level 10', 'Level 30'] as unique values.
X_test
has: ['Level 20', 'Level 40'] as unique values.
The goal is 1) Combine the unique values from both X_train
and X_test
, and then 2) apply the cat.codes
to both dataframes so that they are consistent. How would I do that? Basically the cat.codes
that are applied to both dataframes will be as follows, even though 1 dataframe may not have values the other dataframe has:
{0: 'Level 0', 1: 'Level 10', 2: 'Level 20', 3: 'Level 30', 4: 'Level 40'}
Right now I only have the below but I'm not sure how to get the unique values of both cat.codes
.
X_train['levels'] = X_train['levels'].astype('category').cat.codes
X_test['levels'] = X_test['levels'].astype('category').cat.codes
Use CategoricalDtype
to control the codes:
lst = sorted(set(X_train['levels'].dropna().unique())
| set(X_test['levels'].dropna().unique()))
lvl = pd.CategoricalDtype(lst, ordered=True)
X_train['codes'] = X_train['levels'].astype(lvl).cat.codes
X_test['codes'] = X_test['levels'].astype(lvl).cat.codes
Output:
>>> X_train
levels codes
0 Level 0 0
1 Level 10 1
2 Level 30 3
>>> X_test
levels codes
0 Level 20 2
1 Level 40 4
2 NaN -1