Search code examples
pythonmachine-learningclassificationmissing-dataimputation

Handling missing categorical values ML


I have gone through replace missing values in categorical data regarding handling missing values in categorical data.

Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem

I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.

In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.

Thanks!


Solution

  • There are multiple ways to handle missing data :

    • Some models take care of it (XGBoost, LightGBM for example)
    • You can try to impute them with a model. You should split your data in a train and test set, and try different models to measure which one works best. But more often that not, it doesnt' work very well. There is a KNNImputer implemented in sklearn
    • you can also define rules : set missing values to 0, the mean, median or whatever works, depending on your dataset. The is a SimpleImputer implemenetd in sklearn
    • if none of the above is working for you, you can also get rid of the lines with missing values.

    More details on values imputing in sklearn : https://scikit-learn.org/stable/modules/impute.html