Search code examples
pandasmachine-learningencodinglinear-regressioncategorical-data

Test and Train Data have different cities, how to find and differences and encode using the same coding system on both columns from test & train data


I have a test set and train set. They have a city columns one (train) has 290 unique and the test has 30. I am hoping there is overlap i.e. London, Bristol are in both sets, but Gloucester might be on one set and not the other.

I want to also encode these ,cities to a numerical value that correlates between both sets, so London should be encoded as 1 in test and train.

I have looked at LabelEncoder but cannot see how to get both sets to use the same numbering for cities they both share.

LabelEncoder works fine but no correlation between the two sets.

before:

df_train['City']
'London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester'

df_test['City']
'Paris', 'Rome','Rome','London', 'Gloucester'

output:

df_train['City']
1 2 3 4 1 6

df_test['City']
3 4 4 1 7

Solution

  • You can use a common CategoricalDtype:

    df_train = pd.DataFrame({'City': ['London' , 'Bristol', 'Paris', 'Rome', 'London', 'Worcester']})
    df_test = pd.DataFrame({'City': ['Paris', 'Rome','Rome','London', 'Gloucester']})
    
    cat = pd.CategoricalDtype(pd.concat([df_train['City'], df_test['City']]).unique())
    
    df_train['code'] = df_train['City'].astype(cat).cat.codes
    df_test['code'] = df_test['City'].astype(cat).cat.codes
    

    Output:

    # df_train
            City  code
    0     London     0
    1    Bristol     1
    2      Paris     2
    3       Rome     3
    4     London     0
    5  Worcester     4
    
    # df_test
             City  code
    0       Paris     2
    1        Rome     3
    2        Rome     3
    3      London     0
    4  Gloucester     5
    

    Or just convert the cities to Categorical:

    df_train['City'] = df_train['City'].astype(cat)
    df_test['City'] = df_test['City'].astype(cat)