I am training a machine learning model in order to predict building price.
One of the columns is in what city the building is located. I have a lot of cities
Unincorporated County 244550
Miami 91486
Miami Beach 39880
Hialeah 35439
Doral 20118
Miami Gardens 18031
Aventura 18011
Homestead 16472
Sunny Isles Beach 13587
Coral Gables 13365
North Miami 10843
Cutler Bay 10734
North Miami Beach 9592
Miami Lakes 6986
Palmetto Bay 6039
Key Biscayne 5170
Pinecrest 4575
Hialeah Gardens 4295
South Miami 2864
Sweetwater 2811
Bal Harbour 2794
North Bay Village 2767
Miami Shores 2764
Miami Springs 2689
Opa-locka 2632
Surfside 2401
Bay Harbor Islands 2031
Florida City 1924
West Miami 921
Biscayne Park 717
Medley 708
El Portal 522
Virginia Gardens 370
Golden Beach 283
Indian Creek 24
Here you can see the value_counts() of the column cities, as I can understand it, there is enough examples in order to include it into the model.
The problem comes when I want to split the model into x_train and x_test or do cross_validation. When I split the dataset using :
X_train, X_test, y_train, y_test = train_test_split(
df_x, df_y,
test_size=0.33, random_state=180
)
or I do a cross_validation:
score2 = cross_validate(estimator_pipeline, X= df_x, y= df_y,
scoring=scoring,return_train_score=False, cv=5,n_jobs=2)
I receive this error:
Found unknown categories ['El Portal', 'Florida city, 'Medley'] in column 1 during transform
As I understand about the error is that it's a problem of the one hot encoder, because takes each value of the column cities and create a new column for each city, but when it split between x_train and x_test then it does it before the one hot encoder, then in the partition of the train takes some cities but in the test partition don't take the same city.
Should I do the one hot encoder or the pd.get_dummies() before the partition, or there is a better way to split the dataset to take cities the same cities in the train and test partition?
For these cases, when you're OneHot encoding the categorical variable, you want to set handle_unknown='ignore'
, so that unseen instances in the test set are ignored, and the output matrix has the same shape.
Here's a simple example:
from sklearn.preprocessing import OneHotEncoder
X_train = pd.Series(['West Miami', 'Biscayne Park', 'Medley'])
oh = OneHotEncoder(handle_unknown='ignore')
oh.fit(X_train.values[:,None])
oh.transform(X_train.values[:,None]).toarray()
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])
And if we transform the following test set, with an unseen city, the resulting matrix's shape remains the same:
X_test = pd.Series(['West Miami', 'Biscayne Park', 'Atlanta'])
oh.transform(X_test.values[:,None]).toarray()
array([[0., 0., 1.],
[1., 0., 0.],
[0., 0., 0.]])