I have a dataset that looks like below:
| Amount | Source | y |
| -------- | ------ | - |
| 285 | a | 1 |
| 556 | b | 0 |
| 883 | c | 0 |
| 156 | c | 1 |
| 374 | a | 1 |
| 1520 | d | 0 |
'Source' is the categorical variable. The categories in this field are 'a', 'b', 'c' and 'd'. So the one hot encoded columns are 'source_a', 'source_b', 'source_c' and 'source_d'. I am using this model to predict values for y. The new data for prediction does not contain all categories used in training. It only has categories 'a', 'c' and 'd'. When i one hot encode this dataset, it is missing the column 'source_b'. How do i transform this data to look like training data?
PS: I am using XGBClassifier() for prediction.
Use the same encoder instance. Assuming you opted for sklearn's one hot encoder all you have to do is export it as a pickle to use it later for inference when needed.
from sklearn.preprocessing import OneHotEncoder
import pickle
# blah blah blah
enc = OneHotEncoder(handle_unknown='ignore')
#assume X_train = the source column
X_train = enc.fit_transform(X_train)
pickle.dump(enc, open('onehot.pickle', 'wb'))
And then load it for inference:
import pickle
loaded_enc = pickle.load(open("onehot.pickle", "rb"))
then all you have to do is hit:
#X_test is the source column of your test data
X_test = loaded_enc.transform(X_test)
In general, after you fit your encoder to X_train all you have to do is simply transform the test set. So
X_test = loaded_enc.transform(X_test)