Search code examples
python-3.xmachine-learningscikit-learnrandom-forestone-hot-encoding

One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data


Training dataset total categorical columns: 27

Test dataset total categorical columns: 27

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))

After Encoding, while preparing Test data for prediction,

number of columns from test data: 115

number of columns from train data: 122

I checked the cardinality in the test data, it is low for few columns compare to train data columns.

Train_data.column#1: 2
Test_data:column#1: 1

Train_data.column#2: 5
Test_data:column#2: 3
and more..

so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?


Solution

  • The ideal procedure would be fit the OneHotEncoder in training data and then do a transform in test data. By this way, you will get a consistent number of columns in train and test data.

    Something like the following:

    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_encoder.fit(X_train)
    
    OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))
    

    To understand the column name of the output of OneHotEncoder use get_feature_names method. Probably this answer might help.