pandas machine-learning encoding linear-regression categorical-data

Test and Train Data have different cities, how to find and differences and encode using the same coding system on both columns from test & train data

I have a test set and train set. They have a city columns one (train) has 290 unique and the test has 30. I am hoping there is overlap i.e. London, Bristol are in both sets, but Gloucester might be on one set and not the other.

I want to also encode these ,cities to a numerical value that correlates between both sets, so London should be encoded as 1 in test and train.

I have looked at LabelEncoder but cannot see how to get both sets to use the same numbering for cities they both share.

LabelEncoder works fine but no correlation between the two sets.

before:

df_train['City']
'London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester'

df_test['City']
'Paris', 'Rome','Rome','London', 'Gloucester'

output:

df_train['City']
1 2 3 4 1 6

df_test['City']
3 4 4 1 7

Solution

You can use a common CategoricalDtype:

df_train = pd.DataFrame({'City': ['London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester']})
df_test = pd.DataFrame({'City': ['Paris', 'Rome','Rome','London', 'Gloucester']})

cat = pd.CategoricalDtype(pd.concat([df_train['City'], df_test['City']]).unique())

df_train['code'] = df_train['City'].astype(cat).cat.codes
df_test['code'] = df_test['City'].astype(cat).cat.codes

Output:

# df_train
        City  code
0     London     0
1    Bristol     1
2      Paris     2
3       Rome     3
4     London     0
5  Worcester     4

# df_test
         City  code
0       Paris     2
1        Rome     3
2        Rome     3
3      London     0
4  Gloucester     5

Or just convert the cities to Categorical:

df_train['City'] = df_train['City'].astype(cat)
df_test['City'] = df_test['City'].astype(cat)