python-3.x machine-learning scikit-learn random-forest one-hot-encoding

One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data

Training dataset total categorical columns: 27

Test dataset total categorical columns: 27

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))

After Encoding, while preparing Test data for prediction,

number of columns from test data: 115

number of columns from train data: 122

I checked the cardinality in the test data, it is low for few columns compare to train data columns.

Train_data.column#1: 2
Test_data:column#1: 1

Train_data.column#2: 5
Test_data:column#2: 3
and more..

so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?

Solution

The ideal procedure would be fit the OneHotEncoder in training data and then do a transform in test data. By this way, you will get a consistent number of columns in train and test data.

Something like the following:

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train)

OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))

To understand the column name of the output of OneHotEncoder use get_feature_names method. Probably this answer might help.