Search code examples
pythonmachine-learningscikit-learnpredictioncategorical-data

How to use Binary Encoding of Categorical Columns to predict labels in Python?


I have 2 files test.csv and train.csv. The attribute values are categorical and I am trying to convert them into numerical values.

I am doing the following:

import category_encoders as ce
encoder = ce.BinaryEncoder(cols = 'column_name' , return_df = True)
x_train_data = encoder.fit_transform(x_train_data)

This resulted in a new table with a total of 13 columns. After that, I am training my DecisionTreeClassifier on x_train_data and y_train_data

Finally, I want to predict the Labels in test.csv. If I repeat the BinaryEncoding procedure again on the test.csv, this time it is resulting in < 13 features which I think is due to a lesser number of rows. Due to the difference in total columns, the decision tree classifier won't work.

So, is there a way to predict? And if not then what is the point of Binary Encoder? Since I assume we train a model so that we can predict on an unknown dataset.


Solution

  • You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:

    import category_encoders as ce
    
    train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})
    
    encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
    x_train_data = encoder.fit_transform(train)
    
    #   var1_0  var1_1  var2_0  var2_1
    #0  0       1       0       1
    #1  1       0       0       1
    #2  0       1       0       1
    #3  1       0       0       1
    #4  1       1       1       0
    
    test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
    x_test_data = encoder.transform(test)
    
    #   var1_0  var1_1  var2_0  var2_1
    #0  1       1       0       1
    #1  0       0       0       0
    #2  1       0       0       0
    

    'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.