Search code examples
pythonmachine-learninglogistic-regressionone-hot-encodingdummy-variable

Dummy variable levels not present in unseen data


I have trained a logistic regression model with 5 levels of a categorical variable and all the levels are significant for the model.

However on unseen data, the number of levels of categorical variable is 3. Hence the trained model is failing to predict on the unseen data as its not able to find some of the levels.

I have used one hot encoding to convert the categorical variable. How this issue can be resolved?

Code used to convert to dummy variables in the train set:

   metadata_employeegroup = pd.get_dummies(df['metadata_employeegroup'],prefix='metadata_employeegroup',drop_first=True)
   df = pd.concat([df,metadata_employeegroup],axis=1)

Based on RFE, only some factor levels are significant for the model. So while training the model, am subsetting the train set based on those columns

logsk.fit(X_train[col], y_train)
y_pred = logsk.predict_proba(X_test[col])

Here col contains only 3 levels of metadata_employeegroup. Say L1, L2, L3.

On unseen data, am following the same approach to create the dummy variables. However the levels of metadata_employeegroup are L1 and L2. The trained model is not able to find the L3 level and is throwing an error.


Solution

  • For the levels of categorical variables missing in the unseen data, create new features in the data by adding those missing levels and keeping the value as 0 for all the records.

    I was able to solve using this One Hot Encoding Tutorial