Search code examples
pythonpandasmachine-learningscikit-learndata-processing

There are fewer rows in the dataframe after one hot encoding


I have a dataset which I'd like to one hot encode using sklearn.preprocessing.OneHotEncoder. My problem is that after the encoding, the result contains fewer rows than the original dataset (the difference is 5). Here is my code:

one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat = pd.DataFrame(
            one_hot_encoder.fit_transform(X[categorical_vars]),
            columns=one_hot_encoder.get_feature_names(categorical_vars)
)

Thanks for any advice in advance. :)


Solution

  • I think you get less output columns because you have some unknown categorical features. Because you set the keyword 'handle_unknown' to ignore, thesecolumns are skipped.

    If you give me some of your sample data, I can test it for you and give you a better explanation. Otherwise, I can advise you to read this post. It explains well the purpose of the 'handle_unknown' keyword and why and when to use it.