Search code examples
pythonscikit-learnprediction

How to retain the columns from training data for prediction in python


I have a dataset that looks like below:

| Amount   | Source | y |
| -------- | ------ | - |
| 285      | a      | 1 |
| 556      | b      | 0 | 
| 883      | c      | 0 |
| 156      | c      | 1 |
| 374      | a      | 1 |
| 1520     | d      | 0 |

'Source' is the categorical variable. The categories in this field are 'a', 'b', 'c' and 'd'. So the one hot encoded columns are 'source_a', 'source_b', 'source_c' and 'source_d'. I am using this model to predict values for y. The new data for prediction does not contain all categories used in training. It only has categories 'a', 'c' and 'd'. When i one hot encode this dataset, it is missing the column 'source_b'. How do i transform this data to look like training data?

PS: I am using XGBClassifier() for prediction.


Solution

  • Use the same encoder instance. Assuming you opted for sklearn's one hot encoder all you have to do is export it as a pickle to use it later for inference when needed.

    from sklearn.preprocessing import OneHotEncoder
    import pickle
    # blah blah blah
    
    enc = OneHotEncoder(handle_unknown='ignore')
    #assume X_train = the source column
    X_train = enc.fit_transform(X_train)
    pickle.dump(enc, open('onehot.pickle', 'wb'))
    

    And then load it for inference:

    import pickle
    loaded_enc = pickle.load(open("onehot.pickle", "rb"))
    

    then all you have to do is hit:

    #X_test is the source column of your test data
    X_test = loaded_enc.transform(X_test)
    

    In general, after you fit your encoder to X_train all you have to do is simply transform the test set. So

    X_test = loaded_enc.transform(X_test)