How to retain the columns from training data for prediction in python

I have a dataset that looks like below:

| Amount   | Source | y |
| -------- | ------ | - |
| 285      | a      | 1 |
| 556      | b      | 0 | 
| 883      | c      | 0 |
| 156      | c      | 1 |
| 374      | a      | 1 |
| 1520     | d      | 0 |

'Source' is the categorical variable. The categories in this field are 'a', 'b', 'c' and 'd'. So the one hot encoded columns are 'source_a', 'source_b', 'source_c' and 'source_d'. I am using this model to predict values for y. The new data for prediction does not contain all categories used in training. It only has categories 'a', 'c' and 'd'. When i one hot encode this dataset, it is missing the column 'source_b'. How do i transform this data to look like training data?

PS: I am using XGBClassifier() for prediction.

Solution

Use the same encoder instance. Assuming you opted for sklearn's one hot encoder all you have to do is export it as a pickle to use it later for inference when needed.

from sklearn.preprocessing import OneHotEncoder
import pickle
# blah blah blah

enc = OneHotEncoder(handle_unknown='ignore')
#assume X_train = the source column
X_train = enc.fit_transform(X_train)
pickle.dump(enc, open('onehot.pickle', 'wb'))

And then load it for inference:

import pickle
loaded_enc = pickle.load(open("onehot.pickle", "rb"))

then all you have to do is hit:

#X_test is the source column of your test data
X_test = loaded_enc.transform(X_test)

In general, after you fit your encoder to X_train all you have to do is simply transform the test set. So

X_test = loaded_enc.transform(X_test)