python machine-learning scikit-learn prediction categorical-data

How to use Binary Encoding of Categorical Columns to predict labels in Python?

I have 2 files test.csv and train.csv. The attribute values are categorical and I am trying to convert them into numerical values.

I am doing the following:

import category_encoders as ce
encoder = ce.BinaryEncoder(cols = 'column_name' , return_df = True)
x_train_data = encoder.fit_transform(x_train_data)

This resulted in a new table with a total of 13 columns. After that, I am training my DecisionTreeClassifier on x_train_data and y_train_data

Finally, I want to predict the Labels in test.csv. If I repeat the BinaryEncoding procedure again on the test.csv, this time it is resulting in < 13 features which I think is due to a lesser number of rows. Due to the difference in total columns, the decision tree classifier won't work.

So, is there a way to predict? And if not then what is the point of Binary Encoder? Since I assume we train a model so that we can predict on an unknown dataset.

Solution

You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:

import category_encoders as ce

train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})

encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)

#   var1_0  var1_1  var2_0  var2_1
#0  0       1       0       1
#1  1       0       0       1
#2  0       1       0       1
#3  1       0       0       1
#4  1       1       1       0

test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)

#   var1_0  var1_1  var2_0  var2_1
#0  1       1       0       1
#1  0       0       0       0
#2  1       0       0       0

'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.