Training dataset total categorical columns: 27
Test dataset total categorical columns: 27
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))
After Encoding, while preparing Test data for prediction,
number of columns from test data: 115
number of columns from train data: 122
I checked the cardinality in the test data, it is low for few columns compare to train data columns.
Train_data.column#1: 2 Test_data:column#1: 1 Train_data.column#2: 5 Test_data:column#2: 3 and more..
so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?
The ideal procedure would be fit the OneHotEncoder
in training data and then do a transform
in test data. By this way, you will get a consistent number of columns in train and test data.
Something like the following:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train)
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))
To understand the column name of the output of OneHotEncoder
use get_feature_names
method. Probably this answer might help.