I have 2 files test.csv
and train.csv
. The attribute values are categorical and I am trying to convert them into numerical values.
I am doing the following:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols = 'column_name' , return_df = True)
x_train_data = encoder.fit_transform(x_train_data)
This resulted in a new table with a total of 13 columns.
After that, I am training my DecisionTreeClassifier
on x_train_data
and y_train_data
Finally, I want to predict the Labels
in test.csv
.
If I repeat the BinaryEncoding procedure again on the test.csv, this time it is resulting in < 13 features
which I think is due to a lesser number of rows.
Due to the difference in total columns, the decision tree classifier won't work.
So, is there a way to predict? And if not then what is the point of Binary Encoder? Since I assume we train a model so that we can predict on an unknown dataset.
You just do the transform()
on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown
parameter). For example:
import category_encoders as ce
train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})
encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)
# var1_0 var1_1 var2_0 var2_1
#0 0 1 0 1
#1 1 0 0 1
#2 0 1 0 1
#3 1 0 0 1
#4 1 1 1 0
test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)
# var1_0 var1_1 var2_0 var2_1
#0 1 1 0 1
#1 0 0 0 0
#2 1 0 0 0
'D'
doesn't occur in var1
in training data, so it was encoded as 0 0
. 'C'
and 'F'
don't occur in var2
in training data, so they were both encoded as 0 0
.