I have a test set and train set. They have a city columns one (train) has 290 unique and the test has 30. I am hoping there is overlap i.e. London, Bristol are in both sets, but Gloucester might be on one set and not the other.
I want to also encode these ,cities to a numerical value that correlates between both sets, so London should be encoded as 1 in test and train.
I have looked at LabelEncoder but cannot see how to get both sets to use the same numbering for cities they both share.
LabelEncoder works fine but no correlation between the two sets.
before:
df_train['City']
'London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester'
df_test['City']
'Paris', 'Rome','Rome','London', 'Gloucester'
output:
df_train['City']
1 2 3 4 1 6
df_test['City']
3 4 4 1 7
You can use a common CategoricalDtype
:
df_train = pd.DataFrame({'City': ['London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester']})
df_test = pd.DataFrame({'City': ['Paris', 'Rome','Rome','London', 'Gloucester']})
cat = pd.CategoricalDtype(pd.concat([df_train['City'], df_test['City']]).unique())
df_train['code'] = df_train['City'].astype(cat).cat.codes
df_test['code'] = df_test['City'].astype(cat).cat.codes
Output:
# df_train
City code
0 London 0
1 Bristol 1
2 Paris 2
3 Rome 3
4 London 0
5 Worcester 4
# df_test
City code
0 Paris 2
1 Rome 3
2 Rome 3
3 London 0
4 Gloucester 5
Or just convert the cities to Categorical:
df_train['City'] = df_train['City'].astype(cat)
df_test['City'] = df_test['City'].astype(cat)