python machine-learning scikit-learn train-test-split

Not same columns in train-est split for machine learning model Python

I am training a machine learning model in order to predict building price.

One of the columns is in what city the building is located. I have a lot of cities

Unincorporated County    244550
Miami                     91486
Miami Beach               39880
Hialeah                   35439
Doral                     20118
Miami Gardens             18031
Aventura                  18011
Homestead                 16472
Sunny Isles Beach         13587
Coral Gables              13365
North Miami               10843
Cutler Bay                10734
North Miami Beach          9592
Miami Lakes                6986
Palmetto Bay               6039
Key Biscayne               5170
Pinecrest                  4575
Hialeah Gardens            4295
South Miami                2864
Sweetwater                 2811
Bal Harbour                2794
North Bay Village          2767
Miami Shores               2764
Miami Springs              2689
Opa-locka                  2632
Surfside                   2401
Bay Harbor Islands         2031
Florida City               1924
West Miami                  921
Biscayne Park               717
Medley                      708
El Portal                   522
Virginia Gardens            370
Golden Beach                283
Indian Creek                 24

Here you can see the value_counts() of the column cities, as I can understand it, there is enough examples in order to include it into the model.

The problem comes when I want to split the model into x_train and x_test or do cross_validation. When I split the dataset using :

X_train, X_test, y_train, y_test = train_test_split(
    df_x, df_y,
    test_size=0.33, random_state=180
)

or I do a cross_validation:

score2 = cross_validate(estimator_pipeline, X= df_x, y= df_y, 
scoring=scoring,return_train_score=False, cv=5,n_jobs=2)

I receive this error:

Found unknown categories ['El Portal', 'Florida city, 'Medley'] in column 1 during transform

As I understand about the error is that it's a problem of the one hot encoder, because takes each value of the column cities and create a new column for each city, but when it split between x_train and x_test then it does it before the one hot encoder, then in the partition of the train takes some cities but in the test partition don't take the same city.

Should I do the one hot encoder or the pd.get_dummies() before the partition, or there is a better way to split the dataset to take cities the same cities in the train and test partition?

Solution

For these cases, when you're OneHot encoding the categorical variable, you want to set handle_unknown='ignore', so that unseen instances in the test set are ignored, and the output matrix has the same shape.

Here's a simple example:

from sklearn.preprocessing import OneHotEncoder

X_train = pd.Series(['West Miami', 'Biscayne Park', 'Medley'])
oh = OneHotEncoder(handle_unknown='ignore')
oh.fit(X_train.values[:,None])

oh.transform(X_train.values[:,None]).toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

And if we transform the following test set, with an unseen city, the resulting matrix's shape remains the same:

X_test = pd.Series(['West Miami', 'Biscayne Park', 'Atlanta'])

oh.transform(X_test.values[:,None]).toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 0.]])